Cleanup compaction / rewrite sstable block behind regular compaction for each sstable #10175

bhalevy · 2022-03-06T12:29:23Z

Opening this issue based on my analysis in #10060 (comment)

With 40a6049, rewrite_sstables takes the write lock for each sstable.
Since it will block behind regular compaction, the whole cleanup task might take a very long time.

The purpose of this change was to serialize cleanup or upgrade_sstables with other "maintenance" compaction tasks.

This was achieved in a better way in 6737c88, and after this change rewrite_sstables can safely take the read lock rather the write lock, and let regular compaction run in parallel without interruption.

That said, I don't understand why rewrite_sstables acquires the _maintenance_ops_sem per sstable and not once for the whole task.

The text was updated successfully, but these errors were encountered:

bhalevy · 2022-03-06T18:53:13Z

This is hard to reproduce in dtest, since we need large sstables and probably heavy write load to trigger long compactions in parallel to cleanup (or upgrade sstables for that matter).

I was able to reproduce cleanup slowdown in simulation by injecting 1 second sleep to each compaction job,
taking ~37 seconds to cleanup 50 sstables.
When removing the write lock cleanup time went down to little less than a second (all in dev mode).

bhalevy · 2022-03-06T21:36:15Z

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.

I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

…eyspace Maintenance operations need not be serialized globally. There is no reason not to run e.g. cleanup or reshape on multiple keyspaces and even multiple tables within each keyspace in parallel, so let the api layer serialize maintenance operations per keyspace and schedule the compaction_manager tasks in the keyspace. Refs scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…jor and regular compaction Maintenance operations like cleanup, upgrade, reshape, and reshard are serialized with major compaction using `serialize_maintenance_operation` and the per-keyspace _maintenance_ops semaphore. Fixes scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…roup With scylladb#10175 fixed, rewrite_sstables can run compaction in maintenance scheduling group again, reinstating 4c05e5f and reverting a9427f1 that reverted it. Refs scylladb#10060 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

This change effectively reverts 40a6049 that is no longer required after 6737c88 where the _maintenance_ops_sem is used to serialize tasks using rewrite_sstables like cleanup, upgrade, reshape, or scrub, with major compaction. Since regular compaction may run in parallel no lock is required per-table. This solves the slowness in cleanup compaction (and likely in upgrade sstables seen in scylladb#10060, and possibly scylladb#10166. Fixes scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…roup With scylladb#10175 fixed, rewrite_sstables can run compaction in maintenance scheduling group again, reinstating 4c05e5f and reverting a9427f1 that reverted it. Refs scylladb#10060 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2022-03-07T12:26:42Z

Also refs #10097 that is focused on adding the output sstables in batches, to reduce write amplification and increase compaction efficiency of the regular compaction that runs in parallel in cleanup/upgrade sstables.

raphaelsc · 2022-03-07T17:59:40Z

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.

I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

I think this could be a regression. Maintenance operations cannot be made faster by running more of them. One cleanup per shard is enough to saturate its available bandwidth, as compaction takes advantage of read ahead and write behind for disk utilization.

raphaelsc · 2022-03-07T18:02:57Z

Opening this issue based on my analysis in #10060 (comment)

With 40a6049, rewrite_sstables takes the write lock for each sstable. Since it will block behind regular compaction, the whole cleanup task might take a very long time.

The purpose of this change was to serialize cleanup or upgrade_sstables with other "maintenance" compaction tasks.

This was achieved in a better way in 6737c88, and after this change rewrite_sstables can safely take the read lock rather the write lock, and let regular compaction run in parallel without interruption.

Thanks for working on this Benny! It definitely makes sense making rewrite_sstables() acquire the read lock instead.

That said, I don't understand why rewrite_sstables acquires the _maintenance_ops_sem per sstable and not once for the whole task.

Yes, that's definitely asking for more trouble as we could have cleanup of different tables running intermixed. It's better to allow one at a time.

bhalevy · 2022-03-08T05:58:38Z

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.
I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

I think this could be a regression. Maintenance operations cannot be made faster by running more of them. One cleanup per shard is enough to saturate its available bandwidth, as compaction takes advantage of read ahead and write behind for disk utilization.

Why do we run parallel compaction jobs on multiple buckets at a time then?
I don't think the upper level logic should second guess resource scheduling, but rather it should do a very rough throttling and let the cpu / mem / disk schedulers do their job based on the perceived load.

If e.g. scrub is running on one keyspace or table, why not allow major compaction to run on another keyspace?
We allow regular compaction running there in parallel, don't we?

Or if we run resharding for nodetool refresh, that works on a specific table's uploads dir, why does it need to take a shard-global lock and prevent e.g. major compaction on an unrelated keyspace or table?

bhalevy · 2022-03-08T06:03:04Z

we could have cleanup of different tables running intermixed. It's better to allow one at a time.

I'm not sure this is a problem.
We allow regular compaction on all tables in parallel and that's just fine.
If we need to throttle then number of compaction jobs running in parallel to cap the overall resources compaction is consuming, we should have a concurrency semaphore similar to the one we use for queries.
I don't think we should use a shard-global high level semaphore for that.
The latter should be used at the api layer to serialize high-level maintenance operations where needed.

Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain 6737c88. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in scylladb#10060, and possibly scylladb#10166. Fixes scylladb#10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

raphaelsc · 2022-03-08T14:29:28Z

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.
I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

I think this could be a regression. Maintenance operations cannot be made faster by running more of them. One cleanup per shard is enough to saturate its available bandwidth, as compaction takes advantage of read ahead and write behind for disk utilization.

Why do we run parallel compaction jobs on multiple buckets at a time then? I don't think the upper level logic should second guess resource scheduling, but rather it should do a very rough throttling and let the cpu / mem / disk schedulers do their job based on the perceived load.

If e.g. scrub is running on one keyspace or table, why not allow major compaction to run on another keyspace? We allow regular compaction running there in parallel, don't we?

Or if we run resharding for nodetool refresh, that works on a specific table's uploads dir, why does it need to take a shard-global lock and prevent e.g. major compaction on an unrelated keyspace or table?

We do allow parallel compaction jobs on multiple buckets for reducing read amplification, never for increasing compaction efficiency. In the past, manager was very keen at running more jobs in parallel, but that was a terrible mistake, as efficiency is a function of informed decisions, not parallelism.

For example, if we serialized compaction on a table, and we're compacting 64G of data at 1GB/s, and if we're flushing new files at the same rate, by the time compaction completes, we'd have generated 64 files in level 0.

We can of course allow maintenance operations to go in parallel, but we only have to keep in mind we're not doing so for increasing efficiency.

raphaelsc · 2022-03-08T14:33:45Z

we could have cleanup of different tables running intermixed. It's better to allow one at a time.

I'm not sure this is a problem. We allow regular compaction on all tables in parallel and that's just fine. If we need to throttle then number of compaction jobs running in parallel to cap the overall resources compaction is consuming, we should have a concurrency semaphore similar to the one we use for queries. I don't think we should use a shard-global high level semaphore for that. The latter should be used at the api layer to serialize high-level maintenance operations where needed.

I don't see any good reason to run more than one cleanup for multiple tables at a time.

But I can see very good reasons for running cleanup for a single table a a time, which is resiliency, we can start with smallest table to increase chances of success later on. Implementation of #10097 becomes more efficient as the cleanup for a table T will not be fighting for resources with cleanup for another table, and so the proposed batching can be done faster.

bhalevy · 2022-03-13T22:19:00Z

Should be backported to 4.6 based on #10060 (comment)

Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain 6737c88. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in #10060, and possibly #10166. Fixes #10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10177 (cherry picked from commit 11ea2ff)

denesb · 2022-03-14T11:15:03Z

@bhalevy please prepare a PR with the backport to 4.6, it doesn't apply cleanly (it applies cleanly to 5.0).

bhalevy · 2022-03-14T15:24:49Z

@bhalevy please prepare a PR with the backport to 4.6, it doesn't apply cleanly (it applies cleanly to 5.0).

@denesb patch sent to scylladb-dev

Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain 6737c88. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in #10060, and possibly #10166. Fixes #10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10177 (cherry picked from commit 11ea2ff) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220314151416.2496374-1-bhalevy@scylladb.com>

denesb · 2022-03-15T11:56:07Z

Backported to all vulnerable branches (4.6, 5.0), removing "Backport candidate" label.

bhalevy mentioned this issue Mar 6, 2022

Overhaul compaction_manager::task #10122

Merged

bhalevy mentioned this issue Mar 7, 2022

compaction_manager: rewrite_sstables: do not acquire table write lock #10177

Closed

scylladb-promoter closed this as completed in 11ea2ff Mar 9, 2022

scylladb-promoter added the Backport candidate label Mar 9, 2022

bhalevy mentioned this issue Mar 14, 2022

nodetool cleanup takes much longer (x10) on 4.6.rc5 #10060

Closed

denesb removed the Backport candidate label Mar 15, 2022

DoronArazii added this to the 5.0 milestone Jul 7, 2022

scylladb-promoter added the Backport candidate label Jul 21, 2022

avikivity removed the Backport candidate label Jul 25, 2022

DoronArazii modified the milestones: 5.0, 5.1 Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup compaction / rewrite sstable block behind regular compaction for each sstable #10175

Cleanup compaction / rewrite sstable block behind regular compaction for each sstable #10175

bhalevy commented Mar 6, 2022

bhalevy commented Mar 6, 2022

bhalevy commented Mar 6, 2022

bhalevy commented Mar 7, 2022

raphaelsc commented Mar 7, 2022 •

edited

Loading

raphaelsc commented Mar 7, 2022

bhalevy commented Mar 8, 2022

bhalevy commented Mar 8, 2022

raphaelsc commented Mar 8, 2022

raphaelsc commented Mar 8, 2022

bhalevy commented Mar 13, 2022

denesb commented Mar 14, 2022

bhalevy commented Mar 14, 2022

denesb commented Mar 15, 2022

Cleanup compaction / rewrite sstable block behind regular compaction for each sstable #10175

Cleanup compaction / rewrite sstable block behind regular compaction for each sstable #10175

Comments

bhalevy commented Mar 6, 2022

bhalevy commented Mar 6, 2022

bhalevy commented Mar 6, 2022

bhalevy commented Mar 7, 2022

raphaelsc commented Mar 7, 2022 • edited Loading

raphaelsc commented Mar 7, 2022

bhalevy commented Mar 8, 2022

bhalevy commented Mar 8, 2022

raphaelsc commented Mar 8, 2022

raphaelsc commented Mar 8, 2022

bhalevy commented Mar 13, 2022

denesb commented Mar 14, 2022

bhalevy commented Mar 14, 2022

denesb commented Mar 15, 2022

raphaelsc commented Mar 7, 2022 •

edited

Loading