New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offstrategy compaction on TWCS tables trigger a lot of reactor stalls #9199
Comments
I think it's related to However, the workload ran for 120 minutes. 120/5 = 24, this is not a huge number and should not cause a stall. |
$ grep Reshaped */system.log | sort -n -k 14 | tail -n 10
longevity-twcs-3h-master-db-node-8d71d74e-1/system.log:Aug 11 08:59:43 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]: [shard 3] compaction - [Reshape scylla_bench_1.test1 767120b0-fa82-11eb-afa2-37957b9762d6] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35195-big-Data.db:level=0]. 14MB to 9MB (~61% of original) in 3484ms = 2MB/s. ~202496 total partitions merged to 154260.
longevity-twcs-3h-master-db-node-8d71d74e-1/system.log:Aug 11 08:59:53 longevity-twcs-3h-master-db-node-8d71d74e-1 scylla[9880]: [shard 3] compaction - [Reshape scylla_bench_1.test1 788867f0-fa82-11eb-afa2-37957b9762d6] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35203-big-Data.db:level=0]. 414MB to 12MB (~3% of original) in 9743ms = 1MB/s. ~203904 total partitions merged to 155948.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 08:59:42 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: [shard 2] compaction - [Reshape scylla_bench_1.test1 76736aa0-fa82-11eb-a4c3-152fdfe09b31] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-17666-big-Data.db:level=0]. 28MB to 3MB (~11% of original) in 2091ms = 1MB/s. ~106240 total partitions merged to 47515.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 08:59:42 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: [shard 3] compaction - [Reshape scylla_bench_1.test1 767406e0-fa82-11eb-b3a3-152ddfe09b31] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35299-big-Data.db:level=0]. 14MB to 9MB (~61% of original) in 1947ms = 4MB/s. ~202496 total partitions merged to 154260.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 08:59:48 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: [shard 3] compaction - [Reshape scylla_bench_1.test1 77a16350-fa82-11eb-b3a3-152ddfe09b31] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35307-big-Data.db:level=0]. 414MB to 12MB (~3% of original) in 6479ms = 1MB/s. ~203904 total partitions merged to 155948.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 08:59:45 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]: [shard 2] compaction - [Reshape scylla_bench_1.test1 76720b10-fa82-11eb-8d23-ef699da595fa] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-17602-big-Data.db:level=0]. 128MB to 3MB (~3% of original) in 4770ms = 816kB/s. ~106240 total partitions merged to 47515.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 08:59:45 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]: [shard 3] compaction - [Reshape scylla_bench_1.test1 76725930-fa82-11eb-9374-ef6c9da595fa] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35155-big-Data.db:level=0]. 410MB to 12MB (~3% of original) in 5051ms = 2MB/s. ~202496 total partitions merged to 154260.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 08:59:48 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]: [shard 3] compaction - [Reshape scylla_bench_1.test1 7978bb60-fa82-11eb-9374-ef6c9da595fa] Reshaped 728 sstables to [/var/lib/scylla/data/scylla_bench_1/test1-4584da90fa7e11eb86f037997b9762d6/md-35163-big-Data.db:level=0]. 15MB to 9MB (~61% of original) in 3098ms = 3MB/s. ~203904 total partitions merged to 155948.
longevity-twcs-3h-master-db-node-8d71d74e-3/system.log:Aug 11 14:36:32 longevity-twcs-3h-master-db-node-8d71d74e-3 scylla[746]: [shard 2] compaction - [Reshape keyspace_twcs.twcs_cf 7f64f410-fab1-11eb-a4c3-152fdfe09b31] Reshaped 729 sstables to [/var/lib/scylla/data/keyspace_twcs/twcs_cf-2bea7b70fa7f11eb8addef6b9da595fa/md-89546-big-Data.db:level=0]. 339MB to 18MB (~5% of original) in 10785ms = 1MB/s. ~266496 total partitions merged to 217681.
longevity-twcs-3h-master-db-node-8d71d74e-4/system.log:Aug 11 14:36:23 longevity-twcs-3h-master-db-node-8d71d74e-4 scylla[758]: [shard 2] compaction - [Reshape keyspace_twcs.twcs_cf 7a437a10-fab1-11eb-8d23-ef699da595fa] Reshaped 729 sstables to [/var/lib/scylla/data/keyspace_twcs/twcs_cf-2bea7b70fa7f11eb8addef6b9da595fa/md-33610-big-Data.db:level=0]. 375MB to 16MB (~4% of original) in 10422ms = 1MB/s. ~247936 total partitions merged to 199159. so at most 729 sstables were reshaped. That shouldn't cause such huge stalls. |
/cc @raphaelsc |
up to 50k files per shard, which explains the stalls, see:
offstrategy even failed due to high number of sstables. table scylla_bench_1.test1 which had ~49k files use 1 minute window. with TWCS, there will be one file per vnode range per time window bucket. that's why we have ~728 (256 * RF=3) files being reshaped at once, those belong to the same window. they don't overlap so they're all reshape at once using incremental selector, to keep memory usage bounded. @asias I assumed that after asias@da04819, the number of sstables created by a repair would be #_of_vnode_ranges at most. for 49k files, we can do (~49k / 728), so about 67 windows were repaired. if we keep the number of windows low as low as 10, the number of sstables would be reduced considerably, but would still be at thousands. |
@avikivity we should probably limit the concurrency at which sstables are destroyed, to avoid similar stalls (results from unbounded mem usage):
this can potentially happen whenever thousands of sstables are destroyed at once. destroying a sstable is not trival anymore as its index cache have to be evicted, which is another reason to limit concurrency. |
will patch add_sstable() to avoid stalls, similar to how stalls are avoided on compaction completion after f7b731b on every add_sstable(), the sstable set is cloned, so we can introduce a clone_gently() which yields if needed. |
@aleksbykov it's expected that scrub will continue in background after command is terminated with CTRL-C. AFAICT, the termination signal will not be handled by the command and request scylla to stop the background operation. Please rename the issue to "Offstrategy compaction on TWCS tables trigger a lot of reactor stalls" |
Issue was renamed |
How did we get 50k sstables per shard? |
with repair, there's one sstable per vnode range per RF (I'd like to understand this one better @asias, after asias@da04819 I thought it would be only one per vnode range). with TWCS, each sstable is further split into up to N sstables, where N == max # of windows. So assuming RF=3, worst case is (256 * RF) * #_of_windows. Digging the log, I figured out that there were ~92 windows (when ~50k maintenance ssts were created), so (256*3) * 92 gives us a worst case of ~70k sstables. 92 is unreasonable, but scylla should behave better. suggested couple of improvements above (#9199 (comment) and #9199 (comment)). Alternatively, to avoid this boom in # of sstables, we can perhaps not segregate (by timestamp) maintenance sstables and serve them using partitioned set in the maintenance set (which is reasonable because they don't overlap, but on the other range, time-based queries will have poorer perf). so maintenance set will contain only 1 sstable per vnode range instead. reshape will then pick those sstables at once as they don't overlap and produce 1 sstable per time window for integration into the main set. |
Perhaps we shouldn't create one sstable per window when we send sstables over for repair. Since we'll be splitting them anyway as part of offstrategy. |
we do the splitting on the receiver. when writing new sstable for a range A, segregation is applied, which means N sstables are created where N = # of windows. so today, with segregation before offstrategy, there is one sstable per range per window in maintenance set, so a query for partition K will touch N sstables, where N = # of windows. if we don't segregate before offstrategy, there will be only one sstable per range in maintenance set, so a query for partition K will touch only 1 sstable. looks like we can definitely postpone segregation to offstrategy. |
@raphaelsc is 4.5 vulnerable to this? I think not, since make_streaming_producer is new, but maybe it's the result of refactoring and the pre-refactoring code is vulnerable in the same way. |
The code was refactored in 0944d69. |
I think it is vulnerable. |
in 4.5, off-strategy isn't wired to streaming-based ops. repair-based ops is disabled by default, and regular repair isn't offstrateg'ied yet. so I think we're actually not vulnerable. |
assuming someone will enable repair-based ops, then we could consider it vulnerable. but it's experimental right? what do you think?: |
the patch isn't very risky, please let me know if you want me to backport. |
Let's focus on the defaults. If there's bad performance in non-default configuration, we can let it remain and encourage users to upgrade instead. |
Fix present on all active branches, not backporting. |
It looks like a4053db is causing 100% space amplification in TWCS offstrategy compaction since it has to read all the input sstables in the maintenance set to segregate them into windows and it can't retire any of them until offstrategy compaction is done. |
I also wonder what was the reason we enabled off-strategy with TWCS to begin with? |
We need that patch though. Streaming is token oriented. You cannot wait for 756 sstables to accumulate in windows in maintenance set, or you suffer with read issues (actually for bootstrap and replace we can). And if you don't wait for it, you reshape input in small batches but that's almost what we do before off strategy came into existence. |
Write ampl is the major reason for off strategy. If you have space great. If you don't you are doomed |
I sent the patch to optionally disable off strategy for twcs and the long term plan is pushing incremental compaction to twcs |
Somehow the outcome is exactly the opposite, @raphaelsc. The only reason we "need" off-strategy is to make the compaction of streamed data more efficient since sstables are disjoint. Other than that I see zero benefit in off-strategy. And since we already had a "increased min_threshold" trick for compacting streamed data it's a big question how much off-strategy gives you on top of that. |
The focus is reducing write ampl. With ics, the space cost is low. With twcs, the cost is proportional to the size of the table, so the sky is the limit
Yes write efficiency therefore reducing the node op time.
Off strategy reduces write ampl from 3 (2 rewrites) to 2 (1 rewrite). With large data sets can be significant |
Why 3 to 2? |
Users that don't have big twcs tables might face a regression in efficiency. So I think it's a good direction to leave them like that. The patch I sent provides a way to opt out. And on the long run we fix twcs space overhead with incremental compaction |
With off strategy we reshape all unsegregated input at once. without it we fix input in batches of 16 (temp threshold), yielding another 48 that are compacted in another round. Assumed 768 sstables produced (256 ranges, rf 3) |
S/fix/compact/ |
Installation details
Scylla version (or git commit hash): Scylla version 4.6.dev-0.20210728.42e1f318d with build-id d23dd7c84a396ba1cdf9a93e82a05cb944e5c234
Cluster size: 3 nodes
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0a67e737c28ecaa63 (eu-north-1)
instance type: i3.2xlarge
I used 2 stress tools to load cluster with timeseries data:
scylla-bench:
and cassandra-stress command:
c-s use next profile:
Next schema were created:
During load, i stopped on node3 scylla-server. remove several sstables and start scylla again. After scylla is up and nodetool status return that all nodes have status UN, i ran
nodetool scrub
and terminate with Ctrl+C. And right after that i runnodetool repair
.but in log of all nodes (node 1 and node3) in several minutes i got next reactor stalls up to 130 ms:
The scrub process is continue to run in background and in parallel to repair.
i decode several reactor stalls:
Current nodetool status:
All nodes are available
Db logs:
https://cloudius-jenkins-test.s3.amazonaws.com/8d71d74e-55cc-4c51-8f39-8ccfdae88a7d/20210812_082856/db-cluster-8d71d74e.tar.gz
The text was updated successfully, but these errors were encountered: