-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup compaction / rewrite sstable block behind regular compaction for each sstable #10175
Comments
This is hard to reproduce in dtest, since we need large sstables and probably heavy write load to trigger long compactions in parallel to cleanup (or upgrade sstables for that matter). I was able to reproduce cleanup slowdown in simulation by injecting 1 second sleep to each compaction job, |
Also, since I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time. |
…eyspace Maintenance operations need not be serialized globally. There is no reason not to run e.g. cleanup or reshape on multiple keyspaces and even multiple tables within each keyspace in parallel, so let the api layer serialize maintenance operations per keyspace and schedule the compaction_manager tasks in the keyspace. Refs scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…jor and regular compaction Maintenance operations like cleanup, upgrade, reshape, and reshard are serialized with major compaction using `serialize_maintenance_operation` and the per-keyspace _maintenance_ops semaphore. Fixes scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…roup With scylladb#10175 fixed, rewrite_sstables can run compaction in maintenance scheduling group again, reinstating 4c05e5f and reverting a9427f1 that reverted it. Refs scylladb#10060 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This change effectively reverts 40a6049 that is no longer required after 6737c88 where the _maintenance_ops_sem is used to serialize tasks using rewrite_sstables like cleanup, upgrade, reshape, or scrub, with major compaction. Since regular compaction may run in parallel no lock is required per-table. This solves the slowness in cleanup compaction (and likely in upgrade sstables seen in scylladb#10060, and possibly scylladb#10166. Fixes scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…roup With scylladb#10175 fixed, rewrite_sstables can run compaction in maintenance scheduling group again, reinstating 4c05e5f and reverting a9427f1 that reverted it. Refs scylladb#10060 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Also refs #10097 that is focused on adding the output sstables in batches, to reduce write amplification and increase compaction efficiency of the regular compaction that runs in parallel in cleanup/upgrade sstables. |
I think this could be a regression. Maintenance operations cannot be made faster by running more of them. One cleanup per shard is enough to saturate its available bandwidth, as compaction takes advantage of read ahead and write behind for disk utilization. |
Thanks for working on this Benny! It definitely makes sense making rewrite_sstables() acquire the read lock instead.
Yes, that's definitely asking for more trouble as we could have cleanup of different tables running intermixed. It's better to allow one at a time. |
Why do we run parallel compaction jobs on multiple buckets at a time then? If e.g. scrub is running on one keyspace or table, why not allow major compaction to run on another keyspace? Or if we run resharding for |
I'm not sure this is a problem. |
Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain 6737c88. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in scylladb#10060, and possibly scylladb#10166. Fixes scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We do allow parallel compaction jobs on multiple buckets for reducing read amplification, never for increasing compaction efficiency. In the past, manager was very keen at running more jobs in parallel, but that was a terrible mistake, as efficiency is a function of informed decisions, not parallelism. For example, if we serialized compaction on a table, and we're compacting 64G of data at 1GB/s, and if we're flushing new files at the same rate, by the time compaction completes, we'd have generated 64 files in level 0. We can of course allow maintenance operations to go in parallel, but we only have to keep in mind we're not doing so for increasing efficiency. |
I don't see any good reason to run more than one cleanup for multiple tables at a time. But I can see very good reasons for running cleanup for a single table a a time, which is resiliency, we can start with smallest table to increase chances of success later on. Implementation of #10097 becomes more efficient as the cleanup for a table T will not be fighting for resources with cleanup for another table, and so the proposed batching can be done faster. |
Should be backported to 4.6 based on #10060 (comment) |
Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain 6737c88. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in #10060, and possibly #10166. Fixes #10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10177 (cherry picked from commit 11ea2ff)
@bhalevy please prepare a PR with the backport to 4.6, it doesn't apply cleanly (it applies cleanly to 5.0). |
Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain 6737c88. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in #10060, and possibly #10166. Fixes #10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10177 (cherry picked from commit 11ea2ff) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220314151416.2496374-1-bhalevy@scylladb.com>
Backported to all vulnerable branches (4.6, 5.0), removing "Backport candidate" label. |
Opening this issue based on my analysis in #10060 (comment)
With 40a6049, rewrite_sstables takes the write lock for each sstable.
Since it will block behind regular compaction, the whole cleanup task might take a very long time.
The purpose of this change was to serialize cleanup or upgrade_sstables with other "maintenance" compaction tasks.
This was achieved in a better way in 6737c88, and after this change rewrite_sstables can safely take the read lock rather the write lock, and let regular compaction run in parallel without interruption.
That said, I don't understand why rewrite_sstables acquires the
_maintenance_ops_sem
per sstable and not once for the whole task.The text was updated successfully, but these errors were encountered: