Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup compaction / rewrite sstable block behind regular compaction for each sstable #10175

Closed
bhalevy opened this issue Mar 6, 2022 · 13 comments
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Mar 6, 2022

Opening this issue based on my analysis in #10060 (comment)

With 40a6049, rewrite_sstables takes the write lock for each sstable.
Since it will block behind regular compaction, the whole cleanup task might take a very long time.

The purpose of this change was to serialize cleanup or upgrade_sstables with other "maintenance" compaction tasks.

This was achieved in a better way in 6737c88, and after this change rewrite_sstables can safely take the read lock rather the write lock, and let regular compaction run in parallel without interruption.

That said, I don't understand why rewrite_sstables acquires the _maintenance_ops_sem per sstable and not once for the whole task.

@bhalevy
Copy link
Member Author

bhalevy commented Mar 6, 2022

This is hard to reproduce in dtest, since we need large sstables and probably heavy write load to trigger long compactions in parallel to cleanup (or upgrade sstables for that matter).

I was able to reproduce cleanup slowdown in simulation by injecting 1 second sleep to each compaction job,
taking ~37 seconds to cleanup 50 sstables.
When removing the write lock cleanup time went down to little less than a second (all in dev mode).

@bhalevy
Copy link
Member Author

bhalevy commented Mar 6, 2022

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.

I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 6, 2022
…eyspace

Maintenance operations need not be serialized globally.
There is no reason not to run e.g. cleanup or reshape on
multiple keyspaces and even multiple tables within
each keyspace in parallel, so let the api layer
serialize maintenance operations per keyspace
and schedule the compaction_manager tasks in the keyspace.

Refs scylladb#10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 6, 2022
…jor and regular compaction

Maintenance operations like cleanup, upgrade, reshape, and reshard
are serialized with major compaction using `serialize_maintenance_operation`
and the per-keyspace _maintenance_ops semaphore.

Fixes scylladb#10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 6, 2022
…roup

With scylladb#10175 fixed, rewrite_sstables can run compaction
in maintenance scheduling group again, reinstating
4c05e5f and
reverting a9427f1
that reverted it.

Refs scylladb#10060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 6, 2022
This change effectively reverts 40a6049
that is no longer required after 6737c88
where the _maintenance_ops_sem is used to serialize
tasks using rewrite_sstables like cleanup, upgrade, reshape,
or scrub, with major compaction.

Since regular compaction may run in parallel no lock
is required per-table.

This solves the slowness in cleanup compaction (and
likely in upgrade sstables seen in scylladb#10060, and
possibly scylladb#10166.

Fixes scylladb#10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 6, 2022
…roup

With scylladb#10175 fixed, rewrite_sstables can run compaction
in maintenance scheduling group again, reinstating
4c05e5f and
reverting a9427f1
that reverted it.

Refs scylladb#10060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy
Copy link
Member Author

bhalevy commented Mar 7, 2022

Also refs #10097 that is focused on adding the output sstables in batches, to reduce write amplification and increase compaction efficiency of the regular compaction that runs in parallel in cleanup/upgrade sstables.

@raphaelsc
Copy link
Member

raphaelsc commented Mar 7, 2022

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.

I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

I think this could be a regression. Maintenance operations cannot be made faster by running more of them. One cleanup per shard is enough to saturate its available bandwidth, as compaction takes advantage of read ahead and write behind for disk utilization.

@raphaelsc
Copy link
Member

Opening this issue based on my analysis in #10060 (comment)

With 40a6049, rewrite_sstables takes the write lock for each sstable. Since it will block behind regular compaction, the whole cleanup task might take a very long time.

The purpose of this change was to serialize cleanup or upgrade_sstables with other "maintenance" compaction tasks.

This was achieved in a better way in 6737c88, and after this change rewrite_sstables can safely take the read lock rather the write lock, and let regular compaction run in parallel without interruption.

Thanks for working on this Benny! It definitely makes sense making rewrite_sstables() acquire the read lock instead.

That said, I don't understand why rewrite_sstables acquires the _maintenance_ops_sem per sstable and not once for the whole task.

Yes, that's definitely asking for more trouble as we could have cleanup of different tables running intermixed. It's better to allow one at a time.

@bhalevy
Copy link
Member Author

bhalevy commented Mar 8, 2022

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.
I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

I think this could be a regression. Maintenance operations cannot be made faster by running more of them. One cleanup per shard is enough to saturate its available bandwidth, as compaction takes advantage of read ahead and write behind for disk utilization.

Why do we run parallel compaction jobs on multiple buckets at a time then?
I don't think the upper level logic should second guess resource scheduling, but rather it should do a very rough throttling and let the cpu / mem / disk schedulers do their job based on the perceived load.

If e.g. scrub is running on one keyspace or table, why not allow major compaction to run on another keyspace?
We allow regular compaction running there in parallel, don't we?

Or if we run resharding for nodetool refresh, that works on a specific table's uploads dir, why does it need to take a shard-global lock and prevent e.g. major compaction on an unrelated keyspace or table?

@bhalevy
Copy link
Member Author

bhalevy commented Mar 8, 2022

we could have cleanup of different tables running intermixed. It's better to allow one at a time.

I'm not sure this is a problem.
We allow regular compaction on all tables in parallel and that's just fine.
If we need to throttle then number of compaction jobs running in parallel to cap the overall resources compaction is consuming, we should have a concurrency semaphore similar to the one we use for queries.
I don't think we should use a shard-global high level semaphore for that.
The latter should be used at the api layer to serialize high-level maintenance operations where needed.

bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 8, 2022
Since regular compaction may run in parallel no lock
is required per-table.

We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88.
But it can be removed entirely in master in a follow-up patch.

This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in scylladb#10060, and
possibly scylladb#10166.

Fixes scylladb#10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@raphaelsc
Copy link
Member

Also, since _maintenance_ops_sem is global to the compaction_manager, we're compacting one sstable at a time per shard across the whole database.
I have a patch that moves the serialization of maintenance ops to the caller layer (api / distributed_loader), and makes it per-keyspace, so we can run one (sharded) major or offstrategy compaction per keyspace at a time, vs. one (sharded) cleanup / upgrade_sstables / scrub per table in the keyspace since the latter, that use rewrite_sstables, are compacting only a single sstable at a time.

I think this could be a regression. Maintenance operations cannot be made faster by running more of them. One cleanup per shard is enough to saturate its available bandwidth, as compaction takes advantage of read ahead and write behind for disk utilization.

Why do we run parallel compaction jobs on multiple buckets at a time then? I don't think the upper level logic should second guess resource scheduling, but rather it should do a very rough throttling and let the cpu / mem / disk schedulers do their job based on the perceived load.

If e.g. scrub is running on one keyspace or table, why not allow major compaction to run on another keyspace? We allow regular compaction running there in parallel, don't we?

Or if we run resharding for nodetool refresh, that works on a specific table's uploads dir, why does it need to take a shard-global lock and prevent e.g. major compaction on an unrelated keyspace or table?

We do allow parallel compaction jobs on multiple buckets for reducing read amplification, never for increasing compaction efficiency. In the past, manager was very keen at running more jobs in parallel, but that was a terrible mistake, as efficiency is a function of informed decisions, not parallelism.

For example, if we serialized compaction on a table, and we're compacting 64G of data at 1GB/s, and if we're flushing new files at the same rate, by the time compaction completes, we'd have generated 64 files in level 0.

We can of course allow maintenance operations to go in parallel, but we only have to keep in mind we're not doing so for increasing efficiency.

@raphaelsc
Copy link
Member

we could have cleanup of different tables running intermixed. It's better to allow one at a time.

I'm not sure this is a problem. We allow regular compaction on all tables in parallel and that's just fine. If we need to throttle then number of compaction jobs running in parallel to cap the overall resources compaction is consuming, we should have a concurrency semaphore similar to the one we use for queries. I don't think we should use a shard-global high level semaphore for that. The latter should be used at the api layer to serialize high-level maintenance operations where needed.

I don't see any good reason to run more than one cleanup for multiple tables at a time.

But I can see very good reasons for running cleanup for a single table a a time, which is resiliency, we can start with smallest table to increase chances of success later on. Implementation of #10097 becomes more efficient as the cleanup for a table T will not be fighting for resources with cleanup for another table, and so the proposed batching can be done faster.

@bhalevy
Copy link
Member Author

bhalevy commented Mar 13, 2022

Should be backported to 4.6 based on #10060 (comment)

denesb pushed a commit that referenced this issue Mar 14, 2022
Since regular compaction may run in parallel no lock
is required per-table.

We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88.
But it can be removed entirely in master in a follow-up patch.

This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.

Fixes #10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10177

(cherry picked from commit 11ea2ff)
@denesb
Copy link
Contributor

denesb commented Mar 14, 2022

@bhalevy please prepare a PR with the backport to 4.6, it doesn't apply cleanly (it applies cleanly to 5.0).

@bhalevy
Copy link
Member Author

bhalevy commented Mar 14, 2022

@bhalevy please prepare a PR with the backport to 4.6, it doesn't apply cleanly (it applies cleanly to 5.0).

@denesb patch sent to scylladb-dev

avikivity pushed a commit that referenced this issue Mar 14, 2022
Since regular compaction may run in parallel no lock
is required per-table.

We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88.
But it can be removed entirely in master in a follow-up patch.

This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.

Fixes #10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10177

(cherry picked from commit 11ea2ff)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220314151416.2496374-1-bhalevy@scylladb.com>
@denesb
Copy link
Contributor

denesb commented Mar 15, 2022

Backported to all vulnerable branches (4.6, 5.0), removing "Backport candidate" label.

@DoronArazii DoronArazii modified the milestones: 5.0, 5.1 Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants