Off-Strategy Compaction #5226

slivne · 2019-10-27T08:39:35Z

A compaction strategy attempts to serve user workloads with certain characteristics by maintaining certain invariants that guarantee some performance level or other metric. For example, Time Window Compaction Strategy is tuned for time series workloads, maintains the invariant that no sstables cross time windows, and guarantees that the write amplification is O(log(window size / memory size)) rather than size-tiered’s O(log(data size / memory size)). The general-purpose size-tiered compaction strategy maintains the invariant that the number of sstables of similar size is low, and so compaction strategy component of read amplification and write amplification are both O(log(disk size / memory size)).

We define Off-Strategy Compactions as a compaction happening at a state where the invariant is broken. The invariant can be broken for the following reasons:

Resharding
Bootstrap
Repair
Importing sstables (“nodetool refresh”)
Upgrade SSTables (“nodetool upgradesstables”)
Altering the compaction strategy
User load that does not fit the compaction strategy (for example, non-time-series data with time windowed compaction strategy)

While the last case is a user error, the other cases happen during normal or semi-normal operation and will result in degraded performance, without any ability on the user’s part to mitigate the effects.

Types of off-strategy compactions

Resharding - Resharding happens when a node is restarted with a different number of active processors (shards), or if some other partitioner parameter has changed. Since partitioning to shards is static, data must be moved to the new shards; usually requiring a full read and write of the entire data set. While resharding is happening, all reads have to consider all sstables, not just the shard’s sstables, resulting in O(number of shards) additional read amplification. While the bloom filter will reduce this for single-partition reads, range scans will have to read all of those sstables, resulting in lower throughput and lower concurrency (due to increased memory consumption).
Bootstrap/repair - The bootstrap and repair processes are vnode-oriented, so after they complete, the node will have 256 additional sstables. Those new sstables participate in reads, causing read amplification and increased memory consumption.
Changing compaction strategy - When the user changes compaction strategy, the existing sstables do not conform to the new compaction strategy’s invariants, and therefore the new compaction strategy may perform poorly.
Importing sstables - In the general case importing sstables is similar to resharding and changing the compaction strategy. The imported sstables are not properly sharded, and may have been generated using a different compaction strategy (or no compaction strategy at all).

Proposed Solution:

We will enhance the usage of the staging directory (currently used by Materialized Views code_ - so that all this cases will place the ssatbles in the staging directory.
The benefit of using the staging directory is that we can serve reads from this sstables (using a sepearet sstable_set) and also major compact them before moving them to be compacted with other sstables.

slivne · 2019-11-14T13:24:11Z

@bhalevy queue as part of the compaction work ...

asias · 2020-02-10T02:01:12Z

Please reference this issue for all the compaction related read/write workload performance drop during node operations, e.g., bootstrap, decommission, repair.

The table::add_sstable_and_update_cache is used by streaming and repair to add sstables generated during the operation. When we run a node maintenance operation, e.g., bootstrap, repair, a set of new sstables will be created. For a given token range for a given table, typically a vnode token range, streaming will create 1 new sstables per shard, repair will create 1 new sstables per shard on repair followers and RF - 1 new sstables per shard on repair master. Currently, we add those new sstables to the main sstables set directly and let the normal compaction to compact and "reshape" those sstables. As a result, we effectively push the maintenance work to the compaction scheduling group, not running under maintenance scheduling group, having impact on the user read performance. In addition, those sstables without the invariant properties defined by the compaction strategy may cause bad read performance. Before we have the Off-Strategy Compaction which will take a while to be there, we can mitigate the problem by not triggering a compaction every time we add a stream/repair generated sstable. Refs: scylladb#5226 scylladb#5199 scylladb#5109

The table::add_sstable_and_update_cache is used by streaming and repair to add sstables generated during the operation. When we run a node maintenance operation, e.g., bootstrap, repair, a set of new sstables will be created. For a given token range for a given table, typically a vnode token range, streaming will create 1 new sstables per shard, repair will create 1 new sstables per shard on repair followers and RF - 1 new sstables per shard on repair master. Currently, we add those new sstables to the main sstables set directly and let the normal compaction to compact and "reshape" those sstables. As a result, we effectively push the maintenance work to the compaction scheduling group, not running under maintenance scheduling group, having impact on the user read performance. In addition, those sstables without the invariant properties defined by the compaction strategy may cause bad read performance. Before we have the Off-Strategy Compaction which will take a while to be there, we can mitigate the problem by not triggering a compaction every time we add a stream/repair generated sstable. Refs: scylladb#5495 scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884

raphaelsc · 2020-02-21T18:26:28Z

Not sure if major compaction is the panacea for integration, although it improves
the current state.

Consider this case for STCS. Let's say we have 1TB data in active sstable list.

The active sstable list is composed of log2(1TB / ~1G) size tiers (with a min
threshold of 2), where the tiers will be laid out as follow:

[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 32G
[tier 2]: 1 sst of 64G
[tier 3]: 1 sst of 128G
[tier 4]: 1 sst of 256G
[tier 5]: 1 sst of 512G

Now let's say we load 2TB of data spread across many sstables.

If we major compacted them, we'd end up with 1 sstable of 2TB, which if integrated
into the active sstable list, could be left uncompacted for a long time, because
depending on data set size, we couldn't even produce another sstable of a similar size
for the former 2TB sstable to be compacted with.

What if run STCS compactions on this separate sstable list until the compaction
strategy is satisfied?

So the size tiers for this separate sstable list of 2TB would look like as follow when
the strategy is satisfied:

[tier 0]: 1 sst of 16G
[tier 1]: 1 sst of 64G
[tier 2]: 1 sst of 128G
[tier 3]: 1 sst of 256G
[tier 4]: 1 sst of 512G
[tier 5]: 1 sst of 1024G

When we merge the separate sstable list into the active (main) sstable list, it will
look like as follow:

[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 16G
[tier 2]: 1 sst of 32G
[tier 3]: 2 sst of 64G
[tier 4]: 2 sst of 128G
[tier 5]: 2 sst of 256G
[tier 6]: 2 sst of 512G
[tier 7]: 1 sst of 1024G

Then compaction strategy will kick off again, working towards reducing space and
read amplification. If we had integrated a single 2TB sstable instead, we could
end up with a high read and space ampl for a indefinite period of time, until
we run a major compaction for example.

However, it seems to me that, as an optimization, we could run major compaction
on the separate sstable list if its total data size is smaller than the active
sstable list's one, in order to decrease write ampl.

We'll need to properly implement TWCS major compaction for this to succeed, given
that now it breaks the time window invariant by merging all ssts into one.
LCS major compaction already merges all input data into N non-overlapping,
fixed-sstables at the last level.

glommer · 2020-02-21T18:31:24Z

We need to do cross-tier compactions anyway otherwise tombstones don't go away.

And you're right that TWCS will have to have a specialized version of major compaction.

raphaelsc · 2020-02-21T21:50:04Z

We need to think this cross-tier compaction idea through, so as not to increase STCS write amplification when it matters. For example, there's probably no purpose in doing it when the workload is write-only. Also, we shouldn't do it when we're under heavy write workload, unless we're facing a disk pressure scenario and we determine that doing cross-tier compaction will save space by getting rid of obsolete data, or perhaps we want to reduce read ampl. I wouldn't like us to face the same problem when we lowered min threshold to 2 and increased STCS write ampl by a factor of ~2, affecting write perf. So the biggest challenge is determining an appropriate time to trigger cross-tier compaction. Disk space controller will be handy here!

avikivity · 2020-02-23T13:42:08Z

We need to do cross-tier compactions anyway otherwise tombstones don't go away.

And you're right that TWCS will have to have a specialized version of major compaction.

That's easy, override get_major_compaction_job() (and use segregate_by_timestamp() from @denesb), that's only some gluing away.

avikivity · 2020-02-23T13:45:28Z

Not sure if major compaction is the panacea for integration, although it improves
the current state.

Consider this case for STCS. Let's say we have 1TB data in active sstable list.

The active sstable list is composed of log2(1TB / ~1G) size tiers (with a min
threshold of 2), where the tiers will be laid out as follow:
[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 32G
[tier 2]: 1 sst of 64G
[tier 3]: 1 sst of 128G
[tier 4]: 1 sst of 256G
[tier 5]: 1 sst of 512G
Now let's say we load 2TB of data spread across many sstables.

If we major compacted them, we'd end up with 1 sstable of 2TB, which if integrated
into the active sstable list, could be left uncompacted for a long time, because
depending on data set size, we couldn't even produce another sstable of a similar size
for the former 2TB sstable to be compacted with.

What if run STCS compactions on this separate sstable list until the compaction
strategy is satisfied?

So the size tiers for this separate sstable list of 2TB would look like as follow when
the strategy is satisfied:
[tier 0]: 1 sst of 16G
[tier 1]: 1 sst of 64G
[tier 2]: 1 sst of 128G
[tier 3]: 1 sst of 256G
[tier 4]: 1 sst of 512G
[tier 5]: 1 sst of 1024G
When we merge the separate sstable list into the active (main) sstable list, it will
look like as follow:
[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 16G
[tier 2]: 1 sst of 32G
[tier 3]: 2 sst of 64G
[tier 4]: 2 sst of 128G
[tier 5]: 2 sst of 256G
[tier 6]: 2 sst of 512G
[tier 7]: 1 sst of 1024G
Then compaction strategy will kick off again, working towards reducing space and
read amplification. If we had integrated a single 2TB sstable instead, we could
end up with a high read and space ampl for a indefinite period of time, until
we run a major compaction for example.

I don't think it's a big problem. You can arrive at a similar situation by normally writing until you get a 2TB sstable, and thendo overwrites and deletions for the lower tiers until their sstables shrink. So yes, you have a large sstable, but it's not harming anyone.

We need to compact it from time to time to get rid of tombstones, but that's independent of import/streaming scenarios. Perhaps tag any sstable that hasn't been touched for some time for low-priority compaction.

The table::add_sstable_and_update_cache is used by streaming and repair to add sstables generated during the operation. When we run a node maintenance operation, e.g., bootstrap, repair, a set of new sstables will be created. For a given token range for a given table, typically a vnode token range, streaming will create 1 new sstables per shard, repair will create 1 new sstables per shard on repair followers and RF - 1 new sstables per shard on repair master. Currently, we add those new sstables to the main sstables set directly and let the normal compaction to compact and "reshape" those sstables. As a result, we effectively push the maintenance work to the compaction scheduling group, not running under maintenance scheduling group, having impact on the user read performance. In addition, those sstables without the invariant properties defined by the compaction strategy may cause bad read performance. Before we have the Off-Strategy Compaction which will take a while to be there, we can mitigate the problem by not triggering a compaction every time we add a stream generated sstable. Instead we trigger a compaction when all the streaming is done. Refs: scylladb#5495 scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884

raphaelsc · 2020-03-04T14:13:51Z

BTW, I think off-strategy compaction will be easier with STCS in terms of compaction work (write amplification) after we integrate the major compacted sstable(s) into the table's main sstable set. That's because of how strategies work. STCS major compaction generates a single sstable that will be placed at the biggest size tier. TWCS major compaction will generate one sstable for each time window, which means that compaction will subsequently have to compact two sstables for each time window involved. I don't know how this could be done better, but it may implicate in up to all existing time windows being recompacted. And LCS major compaction generates a new level L, composed of N sstables (a run). L is usually the table's highest level, with nodetool compact. We're left with 2 options IIUC: 1) have off-strategy compaction generate a new level L which is equal to table's highest level, meaning the two will have to be merged when integrated because of overlapping invariant. That's not good because that means basically rewriting the entire table, given that highest level contains about 90% of data set. 2) place the level (run) generated by major compaction at a level which is empty and higher than the table's highest level. This looks like the most viable approach. If the new level is considered unnecessary later on, LCS will eventually start demoting sstables from this new level in a non-aggressive way.

Compaction dominating IO bandwidth makes streaming slow during bootstrap. To mitigate problem before we have off-strategy compaction support, we want to reduce the compaction work during bootstrap. The patch 'compaction: use a larger min_threshold during bootstrap, replace' reduces the compaction by increasing the min threshold of sstables for compaction. In this patch, we reduce the number of sstables generated for bootstrap, by increasing the ranges per stream plan. The bootstrap node does not server read, so it does not have cache invalidation problem as it has in decommission. Refs: scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884

raphaelsc · 2020-03-23T19:28:36Z

Resharding analysis for off-strategy compaction
@avikivity @glommer

Right now, resharding may produce N * smp::count sstables, where N is the number of input shared sstables. That's because resharding may decide to put each shared sstable into a separate resharding job, resulting in each input sstable being resharded individually.

To produce only 1 sstable for each shard, as envisioned by off-strategy compaction, resharding as-is would have to run on a single shard with all input sstables. But we need to take advantage of parallelism.

If we split all resharding work among all shards, with each shard running only 1 resharding job (as a sort of major compaction), smp::count sstables would be produced for each shard. Is that acceptable? If not, after resharding completes, we could run a "major compaction" on the smp::count sstables produced by resharding for a given shard.

…aphael " From now on, offstrategy compaction is triggered on completion of repair-based removenode. So compaction will no longer act aggressively while removenode is going on, which helps reducing both latency and operation time. Refs #5226. " * 'offstrategy_removenode' of github.com:raphaelsc/scylla: repair: Wire offstrategy compaction to repair-based removenode table: introduce trigger_offstrategy_compaction() repair/row_level: make operations_supported static const

When decommission is done, all nodes that receive data from the decommission node will run node_ops_cmd::decommission_done handler. Trigger off-strategy compaction inside the handler to wire off-strategy for decommission. Refs scylladb#5226

When decommission is done, all nodes that receive data from the decommission node will run node_ops_cmd::decommission_done handler. Trigger off-strategy compaction inside the handler to wire off-strategy for decommission. Refs #5226 Closes #8607

bhalevy · 2021-05-27T07:20:28Z

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

raphaelsc · 2021-05-27T10:46:27Z

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

you're absolutely correct, sir.

bhalevy · 2022-08-03T12:18:37Z

@slivne with the remaining work here I think we can demote to Eng-3 label, or maybe remove the level entirely.

mykaul · 2023-02-14T08:59:22Z

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

you're absolutely correct, sir.

@bhalevy , @raphaelsc - is the above tracked somewhere?

bhalevy · 2023-02-14T12:58:36Z

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

you're absolutely correct, sir.

@bhalevy , @raphaelsc - is the above tracked somewhere?

No, AFAIK. We can open a separate issue for it if needed.

slivne added feature/enhancement area/stability labels Oct 27, 2019

slivne added this to the 3.x milestone Oct 27, 2019

slivne added the Eng-2 label Oct 27, 2019

slivne mentioned this issue Oct 27, 2019

Better compaction for sstables generated by maintenance operations #5199

Closed

slivne assigned bhalevy Nov 14, 2019

slivne added the area/compaction label Nov 14, 2019

slivne mentioned this issue Dec 12, 2019

Streaming during bootstrapping is very slow due to compactions #5109

Closed

This was referenced Feb 10, 2020

Performance dropped in twice after node decomission #5495

Closed

Tune Streaming throughput #4884

Closed

asias mentioned this issue Feb 21, 2020

table: Do not trigger compaction in add_sstable_and_update_cache #5866

Closed

asias mentioned this issue Mar 18, 2020

Adjust ranges per stream plan for decommission and bootstrap #6036

Closed

slivne mentioned this issue Mar 23, 2020

Prevent 'nodetool refresh' to block until all running compactions on all shards finish #6053

Open

slivne mentioned this issue Apr 14, 2020

Aggressive compactions and poor read performance after bootstrapping #5921

Closed

slivne assigned glommer and unassigned bhalevy May 10, 2020

slivne removed this from the 4.x milestone May 10, 2020

asias mentioned this issue May 7, 2021

repair: Wire off-strategy compaction for decommission #8607

Closed

slivne modified the milestones: 4.6, 4.7 Nov 10, 2021

slivne modified the milestones: 5.0, 5.1 Apr 13, 2022

slivne mentioned this issue May 22, 2022

Adding and Replace node impacts cluster performance with TimeWindowCompactionStrategy #10596

Open

bhalevy assigned cmm Aug 3, 2022

slivne added Eng-3 and removed Eng-2 labels Aug 3, 2022

bhalevy unassigned cmm Sep 7, 2022

DoronArazii modified the milestones: 5.1, 5.2 Oct 12, 2022

DoronArazii modified the milestones: 5.2, 5.x Nov 22, 2022

avikivity closed this as completed Feb 14, 2023

bhalevy mentioned this issue Feb 14, 2023

Perform off-strategy compaction after compaction strategy change #12853

Open

DoronArazii modified the milestones: 5.x, 5.3 Mar 15, 2023

fee-mendes mentioned this issue Jun 23, 2023

docs: no mention of the new off offstrategy compaction #11859

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Off-Strategy Compaction #5226

Off-Strategy Compaction #5226

slivne commented Oct 27, 2019

slivne commented Nov 14, 2019

asias commented Feb 10, 2020

raphaelsc commented Feb 21, 2020 •

edited

glommer commented Feb 21, 2020

raphaelsc commented Feb 21, 2020 •

edited

avikivity commented Feb 23, 2020

avikivity commented Feb 23, 2020

raphaelsc commented Mar 4, 2020

raphaelsc commented Mar 23, 2020

bhalevy commented May 27, 2021

raphaelsc commented May 27, 2021

bhalevy commented Aug 3, 2022

mykaul commented Feb 14, 2023

bhalevy commented Feb 14, 2023

Off-Strategy Compaction #5226

Off-Strategy Compaction #5226

Comments

slivne commented Oct 27, 2019

slivne commented Nov 14, 2019

asias commented Feb 10, 2020

raphaelsc commented Feb 21, 2020 • edited

glommer commented Feb 21, 2020

raphaelsc commented Feb 21, 2020 • edited

avikivity commented Feb 23, 2020

avikivity commented Feb 23, 2020

raphaelsc commented Mar 4, 2020

raphaelsc commented Mar 23, 2020

bhalevy commented May 27, 2021

raphaelsc commented May 27, 2021

bhalevy commented Aug 3, 2022

mykaul commented Feb 14, 2023

bhalevy commented Feb 14, 2023

raphaelsc commented Feb 21, 2020 •

edited

raphaelsc commented Feb 21, 2020 •

edited