Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offstrategy compaction on TWCS tables trigger a lot of reactor stalls #9199

Closed
aleksbykov opened this issue Aug 12, 2021 · 32 comments
Closed
Assignees

Comments

@aleksbykov
Copy link
Contributor

Installation details
Scylla version (or git commit hash): Scylla version 4.6.dev-0.20210728.42e1f318d with build-id d23dd7c84a396ba1cdf9a93e82a05cb944e5c234
Cluster size: 3 nodes
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0a67e737c28ecaa63 (eu-north-1)

instance type: i3.2xlarge

I used 2 stress tools to load cluster with timeseries data:
scylla-bench:

scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=2000000 -clustering-row-count=100000 -clustering-row-size=uniform:1024..4096 -concurrency=30 -max-rate=60000 -rows-per-request=500 -duration=30m -error-at-row-limit 1000 -nodes 10.0.2.2,10.0.1.67 -keyspace scylla_bench_1 -table test1

scylla-bench -workload=timeseries -mode=read -replication-factor=3 -partition-count=2000000 -clustering-row-count=100000 -clustering-row-size=uniform:1024..4096 -write-rate=3 -concurrency=30 -max-rate=60000 -rows-per-request=500 -duration=60m -error-at-row-limit 1000 -provide-upper-bound -start-timestamp=1628670582155215229 -nodes 10.0.2.2,10.0.1.67

and cassandra-stress command:

cassandra-stress user profile=/tmp/twcs_pr.yaml duration=120m ops'(ins=1)' cl=QUORUM -node 10.0.2.2,10.0.1.67 -port jmx=6868 -mode cql3 native -rate threads=100 

c-s use next profile:

keyspace: keyspace_twcs
keyspace_definition: |
  CREATE KEYSPACE keyspace_twcs WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

table: twcs_cf

table_definition: |
  CREATE TABLE twcs_cf (
      key blob,
      key1 blob,
      key2 blob,
      key3 blob,
      time_ck timestamp,
      time timestamp,
      c0 blob,
      c1 blob,
      c2 blob,
      c3 blob,
      c4 blob,
      c5 blob,
      c6 blob,
      c7 blob,
      c8 blob,
      c9 blob,
      PRIMARY KEY ((key,key1),time_ck)
  ) WITH CLUSTERING ORDER BY (time_ck DESC)
      AND compaction = { 'class':'TimeWindowCompactionStrategy', 'compaction_window_unit':'MINUTES', 'compaction_window_size':'5' }

columnspec:
  - name: key
    size: fixed(10)
    population: uniform(1..100M)

  - name: key1
    size: fixed(10)
    population: uniform(1..100M)

  - name: key2
    size: fixed(30)
    cluster: uniform(1..1000)

  - name: key3
    size: fixed(20)
    population: uniform(1..1000)

  - name: time_ck
    cluster: fixed(288)

  - name: time
    cluster: fixed(288)

  - name: c0
    size: fixed(128)

  - name: c1
    size: fixed(128)

  - name: c2
    size: fixed(128)

  - name: c3
    size: fixed(128)

  - name: c4
    size: fixed(128)

  - name: c5
    size: fixed(128)

  - name: c6
    size: fixed(128)

  - name: c7
    size: fixed(128)

  - name: c8
    size: fixed(256)

  - name: c9
    size: fixed(512)


queries:
  ins:
    cql: insert into keyspace_twcs.twcs_cf (key, key1, key2, key3, time_ck, time, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9) VALUES (?,?,?,?, toTimestamp(now()),toTimestamp(now()), ?,?,?,?,?,?,?,?,?,?)

Next schema were created:

CREATE KEYSPACE keyspace_twcs WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}  AND durable_writes = true;

CREATE TABLE keyspace_twcs.twcs_cf (
    key blob,
    key1 blob,
    time_ck timestamp,
    c0 blob,
    c1 blob,
    c2 blob,
    c3 blob,
    c4 blob,
    c5 blob,
    c6 blob,
    c7 blob,
    c8 blob,
    c9 blob,
    key2 blob,
    key3 blob,
    time timestamp,
    PRIMARY KEY ((key, key1), time_ck)
) WITH CLUSTERING ORDER BY (time_ck DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '5', 'compaction_window_unit': 'MINUTES'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 1
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
CREATE KEYSPACE scylla_bench_1 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}  AND durable_writes = true;

CREATE TABLE scylla_bench_1.test1 (
    pk bigint,
    ck bigint,
    v blob,
    PRIMARY KEY (pk, ck)
) WITH CLUSTERING ORDER BY (ck ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'MINUTES'}
    AND compression = {}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 150
    AND gc_grace_seconds = 300
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

CREATE TABLE scylla_bench_1.test_counters (
    pk bigint,
    ck bigint,
    c1 counter,
    c2 counter,
    c3 counter,
    c4 counter,
    c5 counter,
    PRIMARY KEY (pk, ck)
) WITH CLUSTERING ORDER BY (ck ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

During load, i stopped on node3 scylla-server. remove several sstables and start scylla again. After scylla is up and nodetool status return that all nodes have status UN, i ran nodetool scrub and terminate with Ctrl+C. And right after that i run nodetool repair.

scyllaadm@longevity-twcs-3h-master-db-node-8d71d74e-1:~$ nodetool scrub
WARN  14:16:56,206 Only 12.551GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots
^C^C
scyllaadm@longevity-twcs-3h-master-db-node-8d71d74e-1:~$ ^C
scyllaadm@longevity-twcs-3h-master-db-node-8d71d74e-1:~$ ^C
scyllaadm@longevity-twcs-3h-master-db-node-8d71d74e-1:~$ ^C
scyllaadm@longevity-twcs-3h-master-db-node-8d71d74e-1:~$ nodetool repair
[2021-08-11 14:17:21,424] Starting repair command #7, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
[2021-08-11 14:17:27,489] Repair session 7
[2021-08-11 14:17:27,537] Repair session 7 finished
[2021-08-11 14:17:27,613] Starting repair command #8, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)
[2021-08-11 14:22:08,694] Repair session 8
[2021-08-11 14:22:08,781] Repair session 8 finished
[2021-08-11 14:31:39,706] Repair session 9
[2021-08-11 14:31:39,712] Repair session 9 finished
[2021-08-11 14:31:40,045] Starting repair command #10, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2021-08-11 14:31:48,220] Repair session 10
[2021-08-11 14:31:48,232] Repair session 10 finished
[2021-08-11 14:31:48,409] Starting repair command #11, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2021-08-11 14:31:55,536] Repair session 11
[2021-08-11 14:31:55,564] Repair session 11 finished

but in log of all nodes (node 1 and node3) in several minutes i got next reactor stalls up to 130 ms:

Aug 11 15:08:12 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: Reactor stalled for 134 ms on shard 6. Backtrace: 0x40e7d74 0x40e70b1 0x40e7c3f 0x7febf3ec21df 0x181edbc 0x181e40a 0x182036e 0x180e8b7 0x1806f93 0x10ebf60 0x10ec9dd 0x119013f 0x12e03c1 0x12e2591 0x40faccf 0x40fc037 0x4119fa5 0x40d3b1a 0x93f8 0x101902
Aug 11 15:08:36 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: Reactor stalled for 132 ms on shard 4. Backtrace: 0x40e7d74 0x40e70b1 0x40e7c3f 0x7febf3ec21df 0x40c2833 0x40c2a36 0x40c5a11 0x181f955 0x181f9f6 0x181f9f6 0x181f762 0x181f9f6 0x180c6b8 0x1805fde 0x10ebf3b 0x10ec9dd 0x119013f 0x12e03c1 0x12e2591 0x40faccf 0x40fc037 0x4119fa5 0x40d3b1a 0x93f8 0x101902
Aug 11 15:08:36 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: Reactor stalled for 132 ms on shard 7. Backtrace: 0x40e7d74 0x40e70b1 0x40e7c3f 0x7febf3ec21df 0x40c26e5 0x40c2a36 0x40c5a11 0x181d81d 0x180c8d1 0x1805fde 0x10ebf3b 0x10ec9dd 0x119013f 0x12e03c1 0x12e2591 0x40faccf 0x40fc037 0x4119fa5 0x40d3b1a 0x93f8 0x101902

Aug 11 15:08:38 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]: Reactor stalled for 136 ms on shard 6. Backtrace: 0x40e7d74 0x40e70b1 0x40e7c3f 0x7fdfd64ac1df 0x40c26e5 0x40c2a36 0x40c5a11 0x181d81d 0x180c8d1 0x1805fde 0x10ebf3b 0x10ec9dd 0x119013f 0x12e03c1 0x12e2591 0x40faccf 0x40fc037 0x4119fa5 0x40d3b1a 0x93f8 0x101902
Aug 11 15:08:38 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction - [Scrub scylla_bench_1.test1 93bfb090-fab5-11eb-9c57-bb1afa651908] Finished scrubbing in abort mode 1 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-232000-big-Data.db:level=0]. 427MB to 12MB (~3% of original) in 184411ms = 70kB/s. ~163072 total partitions merged to 163040.
Aug 11 15:08:38 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction - [Scrub scylla_bench_1.test1 01bce2c0-fab6-11eb-9c57-bb1afa651908] Scrubbing in abort mode [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-47144-big-Data.db:level=0:origin=compaction]

The scrub process is continue to run in background and in parallel to repair.

i decode several reactor stalls:

Aug 11 16:53:38 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]: Reactor stalled for 265 ms on shard 0. Backtrace: 0x40e7d74 0x40e70b1 0x40e7c3f 0x7fdfd64ac1df 0x40c0c71 0x40c118b 0x40c14b8 0x40c2741 0x40c2a36 0x40c5a11 0x16574e7 0x1664a8d 0x1805149 0x11e1956 0x40faccf 0x40fc037 0x40fb30c 0x40a52b0 0x40a46e6 0xeceef8 0x281e1 0xecc00d


void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:768
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:787
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1221
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1102
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1116
 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1204
?? ??:0
seastar::memory::cpu_pages::find_and_unlink_span_reclaiming(unsigned int) at memory.cc:?
 (inlined by) seastar::memory::cpu_pages::unlink(seastar::memory::page_list&, seastar::memory::page*) at ./build/release/seastar/./seastar/src/core/memory.cc:617
 (inlined by) seastar::memory::cpu_pages::find_and_unlink_span(unsigned int) at ./build/release/seastar/./seastar/src/core/memory.cc:691
 (inlined by) seastar::memory::cpu_pages::find_and_unlink_span_reclaiming(unsigned int) at ./build/release/seastar/./seastar/src/core/memory.cc:698
seastar::memory::cpu_pages::allocate_large_and_trim(unsigned int) at ./build/release/seastar/./seastar/src/core/memory.cc:727
seastar::memory::cpu_pages::allocate_large(unsigned int) at ./build/release/seastar/./seastar/src/core/memory.cc:773
seastar::memory::small_pool::add_more_objects() at ./build/release/seastar/./seastar/src/core/memory.cc:1260
seastar::memory::small_pool::allocate() at ./build/release/seastar/./seastar/src/core/memory.cc:1220
 (inlined by) seastar::memory::cpu_pages::allocate_small(unsigned int) at ./build/release/seastar/./seastar/src/core/memory.cc:833
 (inlined by) seastar::memory::allocate(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1383
operator new(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:2023
void seastar::future<void>::schedule<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(auto:1&&)#1}>({lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}>(sstables::sstable::close_files()::$_68&&, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(auto:1&&)#1}>({lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1381
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1666
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_common<false, seastar::future<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1691
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_maybe_erase<false, seastar::future<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1644
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped<seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}, seastar::future<void> >(sstables::sstable::close_files()::$_68&&) & at ././seastar/include/seastar/core/future.hh:1628
 (inlined by) seastar::future<void> seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&) at ././seastar/include/seastar/core/future.hh:1831
 (inlined by) sstables::sstable::close_files() at ./sstables/sstables.cc:2367
sstables::sstable::destroy() at ./sstables/sstables.cc:3032
sstables::sstables_manager::deactivate(sstables::sstable*) at ./sstables/sstables_manager.cc:82
seastar::internal::lw_shared_ptr_accessors<sstables::sstable, void>::dispose(seastar::lw_shared_ptr_counter_base*) at ././seastar/include/seastar/core/shared_ptr.hh:244
 (inlined by) ~lw_shared_ptr at ././seastar/include/seastar/core/shared_ptr.hh:301
 (inlined by) void std::destroy_at<seastar::lw_shared_ptr<sstables::sstable> >(seastar::lw_shared_ptr<sstables::sstable>*) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/stl_construct.h:88
 (inlined by) void std::_Destroy<seastar::lw_shared_ptr<sstables::sstable> >(seastar::lw_shared_ptr<sstables::sstable>*) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/stl_construct.h:138
 (inlined by) void std::_Destroy_aux<false>::__destroy<seastar::lw_shared_ptr<sstables::sstable>*>(seastar::lw_shared_ptr<sstables::sstable>*, seastar::lw_shared_ptr<sstables::sstable>*) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/stl_construct.h:152
 (inlined by) void std::_Destroy<seastar::lw_shared_ptr<sstables::sstable>*>(seastar::lw_shared_ptr<sstables::sstable>*, seastar::lw_shared_ptr<sstables::sstable>*) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/stl_construct.h:184
 (inlined by) void std::_Destroy<seastar::lw_shared_ptr<sstables::sstable>*, seastar::lw_shared_ptr<sstables::sstable> >(seastar::lw_shared_ptr<sstables::sstable>*, seastar::lw_shared_ptr<sstables::sstable>*, std::allocator<seastar::lw_shared_ptr<sstables::sstable> >&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/alloc_traits.h:738
 (inlined by) ~vector at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/stl_vector.h:680
 (inlined by) table::run_offstrategy_compaction() at ./table.cc:1022
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2230
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2639
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2808
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2691
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:217
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:125
main at ./main.cc:511
?? ??:0
_start at ??:?




 Aug 11 14:48:27 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: Reactor stalled for 134 ms on shard 6. 0x40e7d74 0x40e70b1 0x40e7c3f 0x7febf3ec21df 0x181d782 0x180c8d1 0x1805fde 0x10ebf3b 0x10ec9dd 0x119013f 0x12e03c1 0x12e2591 0x40faccf 0x40fc037 0x4119fa5 0x40d3b1a 0x93f8 0x101902


addr2line -Cpife /usr/lib/debug/.build-id/d2/3dd7c84a396ba1cdf9a93e82a05cb944e5c234.debug  0x40e7d74 0x40e70b1 0x40e7c3f 0x7febf3ec21df 0x181d782 0x180c8d1 0x1805fde 0x10ebf3b 0x10ec9dd 0x119013f 0x12e03c1 0x12e2591 0x40faccf 0x40fc037 0x4119fa5 0x40d3b1a 0x93f8 0x101902
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:768
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:787
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1221
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1102
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1116
 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1204
?? ??:0
void std::_Hashtable<utils::UUID, std::pair<utils::UUID const, sstables::sstable_run>, std::allocator<std::pair<utils::UUID const, sstables::sstable_run> >, std::__detail::_Select1st, std::equal_to<utils::UUID>, std::hash<utils::UUID>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_assign<std::_Hashtable<utils::UUID, std::pair<utils::UUID const, sstables::sstable_run>, std::allocator<std::pair<utils::UUID const, sstables::sstable_run> >, std::__detail::_Select1st, std::equal_to<utils::UUID>, std::hash<utils::UUID>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> > const&, std::__detail::_AllocNode<std::allocator<std::__detail::_Hash_node<std::pair<utils::UUID const, sstables::sstable_run>, true> > > >(std::_Hashtable<utils::UUID, std::pair<utils::UUID const, sstables::sstable_run>, std::allocator<std::pair<utils::UUID const, sstables::sstable_run> >, std::__detail::_Select1st, std::equal_to<utils::UUID>, std::hash<utils::UUID>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> > const&, std::__detail::_AllocNode<std::allocator<std::__detail::_Hash_node<std::pair<utils::UUID const, sstables::sstable_run>, true> > > const&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/hashtable.h:1179
_Hashtable at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/hashtable.h:1279
 (inlined by) unordered_map at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/unordered_map.h:181
 (inlined by) partitioned_sstable_set at ./sstables/sstable_set.cc:242
 (inlined by) std::_MakeUniq<sstables::partitioned_sstable_set>::__single_object std::make_unique<sstables::partitioned_sstable_set, seastar::lw_shared_ptr<schema const> const&, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > const&, boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sstable> >, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >, boost::icl::partial_absorber, std::less, boost::icl::inplace_plus, boost::icl::inter_section, boost::icl::continuous_interval<compatible_ring_position_or_view, std::less>, std::allocator> const&, seastar::lw_shared_ptr<std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sstable> >, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > > const&, std::unordered_map<utils::UUID, sstables::sstable_run, std::hash<utils::UUID>, std::equal_to<utils::UUID>, std::allocator<std::pair<utils::UUID const, sstables::sstable_run> > > const&, bool const&>(seastar::lw_shared_ptr<schema const> const&, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > const&, boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sstable> >, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >, boost::icl::partial_absorber, std::less, boost::icl::inplace_plus, boost::icl::inter_section, boost::icl::continuous_interval<compatible_ring_position_or_view, std::less>, std::allocator> const&, seastar::lw_shared_ptr<std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sstable> >, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > > const&, std::unordered_map<utils::UUID, sstables::sstable_run, std::hash<utils::UUID>, std::equal_to<utils::UUID>, std::allocator<std::pair<utils::UUID const, sstables::sstable_run> > > const&, bool const&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/unique_ptr.h:962
 (inlined by) sstables::partitioned_sstable_set::clone() const at ./sstables/sstable_set.cc:247
sstable_set at ./sstables/sstable_set.cc:76
seastar::lw_shared_ptr<sstables::sstable_set> seastar::lw_shared_ptr<sstables::sstable_set>::make<sstables::sstable_set&>(sstables::sstable_set&) at ././seastar/include/seastar/core/shared_ptr.hh:267
 (inlined by) seastar::lw_shared_ptr<sstables::sstable_set> seastar::make_lw_shared<sstables::sstable_set>(sstables::sstable_set&) at ././seastar/include/seastar/core/shared_ptr.hh:429
 (inlined by) table::do_add_sstable(seastar::lw_shared_ptr<sstables::sstable_set>, seastar::lw_shared_ptr<sstables::sstable>, seastar::bool_class<enable_backlog_tracker_tag>) at ./table.cc:345
table::add_maintenance_sstable(seastar::lw_shared_ptr<sstables::sstable>) at ./table.cc:364
operator() at ./table.cc:397
 (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<table::add_sstable_and_update_cache(seastar::lw_shared_ptr<sstables::sstable>, seastar::bool_class<sstables::offstrategy_tag>)::$_4>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:124
row_cache::external_updater::execute() at ././row_cache.hh:213
 (inlined by) operator() at ./row_cache.cc:1331
seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}>(row_cache::external_updater&&) at ././seastar/include/seastar/core/future.hh:2135
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}>(row_cache::external_updater&&, seastar::internal::monostate) at ././seastar/include/seastar/core/future.hh:1979
 (inlined by) seastar::future<void> seastar::future<void>::then_impl<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, seastar::future<void> >(row_cache::external_updater&&) at ././seastar/include/seastar/core/future.hh:1601
 (inlined by) seastar::internal::future_result<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, void>::future_type seastar::internal::call_then_impl<seastar::future<void> >::run<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}>(seastar::future<void>&, row_cache::external_updater&&) at ././seastar/include/seastar/core/future.hh:1234
 (inlined by) seastar::future<void> seastar::future<void>::then<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, seastar::future<void> >(row_cache::external_updater&&) at ././seastar/include/seastar/core/future.hh:1520
 (inlined by) operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> > at ./row_cache.cc:1329
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(std::__invoke_other, std::function<seastar::future<void> ()>&&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::__invoke_result<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >::type std::__invoke<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(row_cache::external_updater&&, (std::__invoke_result&&)...) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:95
 (inlined by) std::invoke_result<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >::type std::invoke<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(row_cache::external_updater&&, (std::invoke_result&&)...) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/functional:88
 (inlined by) auto seastar::internal::future_invoke<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(row_cache::external_updater&&, std::function<seastar::future<void> ()>&&) at ././seastar/include/seastar/core/future.hh:1211
 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1582
 (inlined by) void seastar::futurize<seastar::future<void> >::satisfy_with_result_of<seastar::future<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >::then_impl_nrvo<row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}, seastar::future<void> >(row_cache::external_updater&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1)#1}&, seastar::future_state<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >&&)#1}::operator()(seastar::internal::promise_base_with_type<void>, seastar::internal::promise_base_with_type<void>&&, seastar::future_state<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}) at ././seastar/include/seastar/core/future.hh:2120
 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1575
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_29::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}, seastar::future<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >::then_impl_nrvo<{lambda(auto:1)#1}, seastar::future<void> >(row_cache::external_updater&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1)#1}&, seastar::future_state<std::chrono::_V2::steady_clock>&&)#1}, std::chrono::_V2::steady_clock>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2230
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2639
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2808
operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3995
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(std::__invoke_other, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>, void>::type std::__invoke_r<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:110
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:291
std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:622
 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:60
?? ??:0
?? ??:0

Current nodetool status:

scyllaadm@longevity-twcs-3h-master-db-node-8d71d74e-1:~$ nodetool status
Datacenter: eu-north
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                               Rack
UN  10.0.2.2   238.96 GB  256          ?       b25ac5a3-a25a-4000-8307-f21166eb0a6d  1a
UN  10.0.1.67  237.12 GB  256          ?       c87ee842-7310-4a9e-b182-1da29b48a787  1a
?N  10.0.2.28  38.99 GB   256          ?       0b15b664-9f61-468f-b03f-72c7276fe368  1a

All nodes are available

| longevity-twcs-3h-master-db-node-8d71d74e-1      | eu-north-1a | 13.48.5.162   | 8d71d74e-55cc-4c51-8f39-8ccfdae88a7d | alex.bykov | Tue Aug 10 14:37:34 2021 |
| longevity-twcs-3h-master-db-node-8d71d74e-3      | eu-north-1a | 13.48.70.38   | 8d71d74e-55cc-4c51-8f39-8ccfdae88a7d | alex.bykov | Tue Aug 10 14:37:34 2021 |
| longevity-twcs-3h-master-loader-node-8d71d74e-1  | eu-north-1a | 13.49.46.162  | 8d71d74e-55cc-4c51-8f39-8ccfdae88a7d | alex.bykov | Tue Aug 10 14:38:17 2021 |
| longevity-twcs-3h-master-monitor-node-8d71d74e-1 | eu-north-1a | 13.49.77.114  | 8d71d74e-55cc-4c51-8f39-8ccfdae88a7d | alex.bykov | Tue Aug 10 14:38:42 2021 |
| longevity-twcs-3h-master-db-node-8d71d74e-4      | eu-north-1a | 13.48.130.237 | 8d71d74e-55cc-4c51-8f39-8ccfdae88a7d | alex.bykov | Tue Aug 10 15:39:20 2021 |

Db logs: https://cloudius-jenkins-test.s3.amazonaws.com/8d71d74e-55cc-4c51-8f39-8ccfdae88a7d/20210812_082856/db-cluster-8d71d74e.tar.gz

@aleksbykov
Copy link
Contributor Author

Probably related to the series of issues: #9053, #9054, #8969, #8808, #8877.

@avikivity
Copy link
Member

I think it's related to 'compaction_window_unit':'MINUTES', 'compaction_window_size':'5'. In the first stall, we're releasing a vector of sstables. In the second stall, we're cloning an sstable_set. Both of these should behave reasonably with a small number of sstables, but can be bad if the number of sstables is huge.

However, the workload ran for 120 minutes. 120/5 = 24, this is not a huge number and should not cause a stall.

@avikivity
Copy link
Member

$ grep Reshaped */system.log | sort -n -k 14 | tail -n 10
longevity-twcs-3h-master-db-node-8d71d74e-1/system.log:Aug 11 08:59:43 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]:  [shard 3] compaction - [Reshape scylla_bench_1.test1 767120b0-fa82-11eb-afa2-37957b9762d6] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35195-big-Data.db:level=0]. 14MB to 9MB (~61% of original) in 3484ms = 2MB/s. ~202496 total partitions merged to 154260.
longevity-twcs-3h-master-db-node-8d71d74e-1/system.log:Aug 11 08:59:53 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]:  [shard 3] compaction - [Reshape scylla_bench_1.test1 788867f0-fa82-11eb-afa2-37957b9762d6] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35203-big-Data.db:level=0]. 414MB to 12MB (~3% of original) in 9743ms = 1MB/s. ~203904 total partitions merged to 155948.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 08:59:42 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]:  [shard 2] compaction - [Reshape scylla_bench_1.test1 76736aa0-fa82-11eb-a4c3-152fdfe09b31] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-17666-big-Data.db:level=0]. 28MB to 3MB (~11% of original) in 2091ms = 1MB/s. ~106240 total partitions merged to 47515.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 08:59:42 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]:  [shard 3] compaction - [Reshape scylla_bench_1.test1 767406e0-fa82-11eb-b3a3-152ddfe09b31] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35299-big-Data.db:level=0]. 14MB to 9MB (~61% of original) in 1947ms = 4MB/s. ~202496 total partitions merged to 154260.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 08:59:48 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]:  [shard 3] compaction - [Reshape scylla_bench_1.test1 77a16350-fa82-11eb-b3a3-152ddfe09b31] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35307-big-Data.db:level=0]. 414MB to 12MB (~3% of original) in 6479ms = 1MB/s. ~203904 total partitions merged to 155948.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 08:59:45 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]:  [shard 2] compaction - [Reshape scylla_bench_1.test1 76720b10-fa82-11eb-8d23-ef699da595fa] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-17602-big-Data.db:level=0]. 128MB to 3MB (~3% of original) in 4770ms = 816kB/s. ~106240 total partitions merged to 47515.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 08:59:45 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]:  [shard 3] compaction - [Reshape scylla_bench_1.test1 76725930-fa82-11eb-9374-ef6c9da595fa] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35155-big-Data.db:level=0]. 410MB to 12MB (~3% of original) in 5051ms = 2MB/s. ~202496 total partitions merged to 154260.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 08:59:48 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]:  [shard 3] compaction - [Reshape scylla_bench_1.test1 7978bb60-fa82-11eb-9374-ef6c9da595fa] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35163-big-Data.db:level=0]. 15MB to 9MB (~61% of original) in 3098ms = 3MB/s. ~203904 total partitions merged to 155948.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 14:36:32 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]:  [shard 2] compaction - [Reshape keyspace_twcs.twcs_cf 7f64f410-fab1-11eb-a4c3-152fdfe09b31] Reshaped 729 sstables to [/var/lib/scylla/data/keyspace_twcs/twcs_cf-2bea7b70fa7f11eb8addef6b9da595fa/md-89546-big-Data.db:level=0]. 339MB to 18MB (~5% of original) in 10785ms = 1MB/s. ~266496 total partitions merged to 217681.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 14:36:23 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]:  [shard 2] compaction - [Reshape keyspace_twcs.twcs_cf 7a437a10-fab1-11eb-8d23-ef699da595fa] Reshaped 729 sstables to [/var/lib/scylla/data/keyspace_twcs/twcs_cf-2bea7b70fa7f11eb8addef6b9da595fa/md-33610-big-Data.db:level=0]. 375MB to 16MB (~4% of original) in 10422ms = 1MB/s. ~247936 total partitions merged to 199159.

so at most 729 sstables were reshaped. That shouldn't cause such huge stalls.

@avikivity
Copy link
Member

/cc @raphaelsc

@raphaelsc
Copy link
Member

up to 50k files per shard, which explains the stalls, see:

Aug 11 09:23:35 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]:  [shard 0] table - Starting off-strategy compaction for scylla_bench.test, 27730 candidates were found
Aug 11 09:24:19 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]:  [shard 0] table - Done with off-strategy compaction for scylla_bench.test
Aug 11 09:24:48 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]:  [shard 0] table - Starting off-strategy compaction for keyspace_twcs.twcs_cf, 681 candidates were found
Aug 11 09:28:50 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]:  [shard 0] table - Done with off-strategy compaction for keyspace_twcs.twcs_cf
Aug 11 10:55:01 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 731 candidates were found
Aug 11 10:55:03 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Done with off-strategy compaction for scylla_bench_1.test1
Aug 11 10:57:59 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for keyspace_twcs.twcs_cf, 2476 candidates were found
Aug 11 10:58:04 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Done with off-strategy compaction for keyspace_twcs.twcs_cf
Aug 11 14:36:10 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for keyspace_twcs.twcs_cf, 7847 candidates were found
Aug 11 14:37:47 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Done with off-strategy compaction for keyspace_twcs.twcs_cf
Aug 11 16:45:16 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 48954 candidates were found
Aug 11 16:45:25 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction_manager - off-strategy compaction failed due to std::filesystem::__cxx11::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-440560-big-Data.db]), retrying...
Aug 11 16:45:30 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 48954 candidates were found
Aug 11 16:45:47 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction_manager - off-strategy compaction failed due to std::filesystem::__cxx11::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-440680-big-Data.db]), retrying...
Aug 11 16:45:57 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 48954 candidates were found
Aug 11 16:46:13 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction_manager - off-strategy compaction failed due to std::filesystem::__cxx11::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/0000000000440816.sstable/md-440816-big-Data.db]), retrying...
Aug 11 16:46:33 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 48954 candidates were found
Aug 11 16:46:40 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction_manager - off-strategy compaction failed due to std::filesystem::__cxx11::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-440856-big-Data.db]), retrying...
Aug 11 16:47:20 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 48954 candidates were found
Aug 11 16:47:25 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction_manager - off-strategy compaction failed due to std::filesystem::__cxx11::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/0000000000440880.sstable/md-440880-big-Data.db]), retrying...
Aug 11 16:48:45 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 48954 candidates were found
Aug 11 16:49:09 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] compaction_manager - off-strategy compaction failed due to std::filesystem::__cxx11::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/0000000000441072.sstable/md-441072-big-TOC.txt.tmp]), retrying...
Aug 11 16:51:49 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Starting off-strategy compaction for scylla_bench_1.test1, 48954 candidates were found
Aug 11 16:53:38 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[138241]:  [shard 0] table - Done with off-strategy compaction for scylla_bench_1.test1

offstrategy even failed due to high number of sstables.

table scylla_bench_1.test1 which had ~49k files use 1 minute window.

with TWCS, there will be one file per vnode range per time window bucket.

that's why we have ~728 (256 * RF=3) files being reshaped at once, those belong to the same window. they don't overlap so they're all reshape at once using incremental selector, to keep memory usage bounded. @asias I assumed that after asias@da04819, the number of sstables created by a repair would be #_of_vnode_ranges at most.

for 49k files, we can do (~49k / 728), so about 67 windows were repaired.

if we keep the number of windows low as low as 10, the number of sstables would be reduced considerably, but would still be at thousands.

@raphaelsc
Copy link
Member

@avikivity we should probably limit the concurrency at which sstables are destroyed, to avoid similar stalls (results from unbounded mem usage):

void seastar::future<void>::schedule<seastar::internal::promise_base_with_type<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(auto:1&&)#1}>({lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}>(sstables::sstable::close_files()::$_68&&, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(auto:1&&)#1}>({lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(auto:1&&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1381
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1666
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_common<false, seastar::future<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1691
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_maybe_erase<false, seastar::future<void>, seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}>(seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}&&) at ././seastar/include/seastar/core/future.hh:1644
 (inlined by) seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped<seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&)::{lambda(auto:1&&)#1}, seastar::future<void> >(sstables::sstable::close_files()::$_68&&) & at ././seastar/include/seastar/core/future.hh:1628
 (inlined by) seastar::future<void> seastar::future<void>::handle_exception<sstables::sstable::close_files()::$_68>(sstables::sstable::close_files()::$_68&&) at ././seastar/include/seastar/core/future.hh:1831
 (inlined by) sstables::sstable::close_files() at ./sstables/sstables.cc:2367
sstables::sstable::destroy() at ./sstables/sstables.cc:3032
sstables::sstables_manager::deactivate(sstables::sstable*) at ./sstables/sstables_manager.cc:82
seastar::internal::lw_shared_ptr_accessors<sstables::sstable, void>::dispose(seastar::lw_shared_ptr_counter_base*) at ././seastar/include/seastar/core/shared_ptr.hh:244
 (inlined by) ~lw_shared_ptr at ././seastar/include/seastar/core/shared_ptr.hh:301

this can potentially happen whenever thousands of sstables are destroyed at once. destroying a sstable is not trival anymore as its index cache have to be evicted, which is another reason to limit concurrency.

@raphaelsc
Copy link
Member

will patch add_sstable() to avoid stalls, similar to how stalls are avoided on compaction completion after f7b731b

on every add_sstable(), the sstable set is cloned, so we can introduce a clone_gently() which yields if needed.

@raphaelsc
Copy link
Member

@aleksbykov it's expected that scrub will continue in background after command is terminated with CTRL-C. AFAICT, the termination signal will not be handled by the command and request scylla to stop the background operation. Please rename the issue to "Offstrategy compaction on TWCS tables trigger a lot of reactor stalls"

@raphaelsc raphaelsc self-assigned this Aug 16, 2021
@aleksbykov aleksbykov changed the title Scrub continue to run in background after nodetool command was terminated and trigger a lot of reactor stalls Offstrategy compaction on TWCS tables trigger a lot of reactor stalls Aug 17, 2021
@aleksbykov
Copy link
Contributor Author

Issue was renamed

@avikivity
Copy link
Member

How did we get 50k sstables per shard?

@raphaelsc
Copy link
Member

How did we get 50k sstables per shard?

with repair, there's one sstable per vnode range per RF (I'd like to understand this one better @asias, after asias@da04819 I thought it would be only one per vnode range). with TWCS, each sstable is further split into up to N sstables, where N == max # of windows. So assuming RF=3, worst case is (256 * RF) * #_of_windows. Digging the log, I figured out that there were ~92 windows (when ~50k maintenance ssts were created), so (256*3) * 92 gives us a worst case of ~70k sstables.

92 is unreasonable, but scylla should behave better. suggested couple of improvements above (#9199 (comment) and #9199 (comment)).

Alternatively, to avoid this boom in # of sstables, we can perhaps not segregate (by timestamp) maintenance sstables and serve them using partitioned set in the maintenance set (which is reasonable because they don't overlap, but on the other range, time-based queries will have poorer perf). so maintenance set will contain only 1 sstable per vnode range instead. reshape will then pick those sstables at once as they don't overlap and produce 1 sstable per time window for integration into the main set.

@avikivity
Copy link
Member

Perhaps we shouldn't create one sstable per window when we send sstables over for repair. Since we'll be splitting them anyway as part of offstrategy.

@raphaelsc
Copy link
Member

Perhaps we shouldn't create one sstable per window when we send sstables over for repair. Since we'll be splitting them anyway as part of offstrategy.

we do the splitting on the receiver. when writing new sstable for a range A, segregation is applied, which means N sstables are created where N = # of windows.

so today, with segregation before offstrategy, there is one sstable per range per window in maintenance set, so a query for partition K will touch N sstables, where N = # of windows.

if we don't segregate before offstrategy, there will be only one sstable per range in maintenance set, so a query for partition K will touch only 1 sstable. looks like we can definitely postpone segregation to offstrategy.

@avikivity
Copy link
Member

@raphaelsc is 4.5 vulnerable to this? I think not, since make_streaming_producer is new, but maybe it's the result of refactoring and the pre-refactoring code is vulnerable in the same way.

@avikivity
Copy link
Member

The code was refactored in 0944d69.

@avikivity
Copy link
Member

I think it is vulnerable.

@raphaelsc
Copy link
Member

The code was refactored in 0944d69.

in 4.5, off-strategy isn't wired to streaming-based ops. repair-based ops is disabled by default, and regular repair isn't offstrateg'ied yet. so I think we're actually not vulnerable.

@raphaelsc
Copy link
Member

assuming someone will enable repair-based ops, then we could consider it vulnerable. but it's experimental right? what do you think?:

@raphaelsc
Copy link
Member

the patch isn't very risky, please let me know if you want me to backport.

@avikivity
Copy link
Member

Let's focus on the defaults. If there's bad performance in non-default configuration, we can let it remain and encourage users to upgrade instead.

@avikivity
Copy link
Member

Fix present on all active branches, not backporting.

@bhalevy
Copy link
Member

bhalevy commented Jan 18, 2024

It looks like a4053db is causing 100% space amplification in TWCS offstrategy compaction since it has to read all the input sstables in the maintenance set to segregate them into windows and it can't retire any of them until offstrategy compaction is done.

Cc @raphaelsc @vladzcloudius

@vladzcloudius
Copy link
Contributor

I also wonder what was the reason we enabled off-strategy with TWCS to begin with?
Because that was the decision that caused this and other related issues, like #16514

@raphaelsc
Copy link
Member

It looks like a4053db is causing 100% space amplification in TWCS offstrategy compaction since it has to read all the input sstables in the maintenance set to segregate them into windows and it can't retire any of them until offstrategy compaction is done.

Cc @raphaelsc @vladzcloudius

We need that patch though. Streaming is token oriented. You cannot wait for 756 sstables to accumulate in windows in maintenance set, or you suffer with read issues (actually for bootstrap and replace we can). And if you don't wait for it, you reshape input in small batches but that's almost what we do before off strategy came into existence.

@raphaelsc
Copy link
Member

I also wonder what was the reason we enabled off-strategy with TWCS to begin with?
Because that was the decision that caused this and other related issues, like #16514

Write ampl is the major reason for off strategy. If you have space great. If you don't you are doomed

@raphaelsc
Copy link
Member

I also wonder what was the reason we enabled off-strategy with TWCS to begin with?
Because that was the decision that caused this and other related issues, like #16514

Write ampl is the major reason for off strategy. If you have space great. If you don't you are doomed

I sent the patch to optionally disable off strategy for twcs and the long term plan is pushing incremental compaction to twcs

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Jan 18, 2024

Write ampl is the major reason for off strategy. If you have space great. If you don't you are doomed

Somehow the outcome is exactly the opposite, @raphaelsc.
off-strategy requires 100% extra disk space while the original behavior (without off-strategy, streamed data is segregated and compacted with high min_threshold immediately).

The only reason we "need" off-strategy is to make the compaction of streamed data more efficient since sstables are disjoint. Other than that I see zero benefit in off-strategy.

And since we already had a "increased min_threshold" trick for compacting streamed data it's a big question how much off-strategy gives you on top of that.

@raphaelsc
Copy link
Member

Write ampl is the major reason for off strategy. If you have space great. If you don't you are doomed

Somehow the outcome is exactly the opposite, @raphaelsc.
off-strategy requires 100% extra disk space while the original behavior (without off-strategy, streamed data is segregated and compacted with high min_threshold immediately).

The focus is reducing write ampl. With ics, the space cost is low. With twcs, the cost is proportional to the size of the table, so the sky is the limit

The only reason we "need" off-strategy is to make the compaction of streamed data more efficient since sstables are disjoint. Other than that I see zero benefit in off-strategy.

Yes write efficiency therefore reducing the node op time.

And since we already had a "increased min_threshold" trick for compacting streamed data it's a big question how much off-strategy gives you on top of that.

Off strategy reduces write ampl from 3 (2 rewrites) to 2 (1 rewrite). With large data sets can be significant

@avikivity
Copy link
Member

Why 3 to 2?

@raphaelsc
Copy link
Member

Users that don't have big twcs tables might face a regression in efficiency. So I think it's a good direction to leave them like that. The patch I sent provides a way to opt out. And on the long run we fix twcs space overhead with incremental compaction

@raphaelsc
Copy link
Member

Why 3 to 2?

With off strategy we reshape all unsegregated input at once. without it we fix input in batches of 16 (temp threshold), yielding another 48 that are compacted in another round. Assumed 768 sstables produced (256 ranges, rf 3)

@raphaelsc
Copy link
Member

Why 3 to 2?

With off strategy we reshape all unsegregated input at once. without it we fix input in batches of 16 (temp threshold), yielding another 48 that are compacted in another round. Assumed 768 sstables produced (256 ranges, rf 3)

S/fix/compact/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants