Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Off-Strategy Compaction #5226

Closed
slivne opened this issue Oct 27, 2019 · 18 comments
Closed

Off-Strategy Compaction #5226

slivne opened this issue Oct 27, 2019 · 18 comments

Comments

@slivne
Copy link
Contributor

slivne commented Oct 27, 2019

From: @avikivity

A compaction strategy attempts to serve user workloads with certain characteristics by maintaining certain invariants that guarantee some performance level or other metric. For example, Time Window Compaction Strategy is tuned for time series workloads, maintains the invariant that no sstables cross time windows, and guarantees that the write amplification is O(log(window size / memory size)) rather than size-tiered’s O(log(data size / memory size)). The general-purpose size-tiered compaction strategy maintains the invariant that the number of sstables of similar size is low, and so compaction strategy component of read amplification and write amplification are both O(log(disk size / memory size)).

We define Off-Strategy Compactions as a compaction happening at a state where the invariant is broken. The invariant can be broken for the following reasons:

  • Resharding
  • Bootstrap
  • Repair
  • Importing sstables (“nodetool refresh”)
  • Upgrade SSTables (“nodetool upgradesstables”)
  • Altering the compaction strategy
  • User load that does not fit the compaction strategy (for example, non-time-series data with time windowed compaction strategy)

While the last case is a user error, the other cases happen during normal or semi-normal operation and will result in degraded performance, without any ability on the user’s part to mitigate the effects.

Types of off-strategy compactions

  • Resharding - Resharding happens when a node is restarted with a different number of active processors (shards), or if some other partitioner parameter has changed. Since partitioning to shards is static, data must be moved to the new shards; usually requiring a full read and write of the entire data set. While resharding is happening, all reads have to consider all sstables, not just the shard’s sstables, resulting in O(number of shards) additional read amplification. While the bloom filter will reduce this for single-partition reads, range scans will have to read all of those sstables, resulting in lower throughput and lower concurrency (due to increased memory consumption).
  • Bootstrap/repair - The bootstrap and repair processes are vnode-oriented, so after they complete, the node will have 256 additional sstables. Those new sstables participate in reads, causing read amplification and increased memory consumption.
  • Changing compaction strategy - When the user changes compaction strategy, the existing sstables do not conform to the new compaction strategy’s invariants, and therefore the new compaction strategy may perform poorly.
  • Importing sstables - In the general case importing sstables is similar to resharding and changing the compaction strategy. The imported sstables are not properly sharded, and may have been generated using a different compaction strategy (or no compaction strategy at all).

Proposed Solution:

We will enhance the usage of the staging directory (currently used by Materialized Views code_ - so that all this cases will place the ssatbles in the staging directory.
The benefit of using the staging directory is that we can serve reads from this sstables (using a sepearet sstable_set) and also major compact them before moving them to be compacted with other sstables.

@slivne
Copy link
Contributor Author

slivne commented Nov 14, 2019

@bhalevy queue as part of the compaction work ...

@asias
Copy link
Contributor

asias commented Feb 10, 2020

Please reference this issue for all the compaction related read/write workload performance drop during node operations, e.g., bootstrap, decommission, repair.

asias added a commit to asias/scylla that referenced this issue Feb 21, 2020
The table::add_sstable_and_update_cache is used by streaming and repair to
add sstables generated during the operation.

When we run a node maintenance operation, e.g., bootstrap, repair, a set
of new sstables will be created. For a given token range for a given
table, typically a vnode token range, streaming will create 1 new
sstables per shard, repair will create 1 new sstables per shard on
repair followers and RF - 1 new sstables per shard on repair master.

Currently, we add those new sstables to the main sstables set directly
and let the normal compaction to compact and "reshape" those sstables.
As a result, we effectively push the maintenance work to the compaction
scheduling group, not running under maintenance scheduling group, having
impact on the user read performance. In addition, those sstables without
the invariant properties defined by the compaction strategy may cause
bad read performance.

Before we have the Off-Strategy Compaction which will take a while to be
there, we can mitigate the problem by not triggering a compaction
every time we add a stream/repair generated sstable.

Refs: scylladb#5226 scylladb#5199 scylladb#5109
asias added a commit to asias/scylla that referenced this issue Feb 21, 2020
The table::add_sstable_and_update_cache is used by streaming and repair to
add sstables generated during the operation.

When we run a node maintenance operation, e.g., bootstrap, repair, a set
of new sstables will be created. For a given token range for a given
table, typically a vnode token range, streaming will create 1 new
sstables per shard, repair will create 1 new sstables per shard on
repair followers and RF - 1 new sstables per shard on repair master.

Currently, we add those new sstables to the main sstables set directly
and let the normal compaction to compact and "reshape" those sstables.
As a result, we effectively push the maintenance work to the compaction
scheduling group, not running under maintenance scheduling group, having
impact on the user read performance. In addition, those sstables without
the invariant properties defined by the compaction strategy may cause
bad read performance.

Before we have the Off-Strategy Compaction which will take a while to be
there, we can mitigate the problem by not triggering a compaction
every time we add a stream/repair generated sstable.

Refs: scylladb#5495 scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884
@raphaelsc
Copy link
Member

raphaelsc commented Feb 21, 2020

Not sure if major compaction is the panacea for integration, although it improves
the current state.

Consider this case for STCS. Let's say we have 1TB data in active sstable list.

The active sstable list is composed of log2(1TB / ~1G) size tiers (with a min
threshold of 2), where the tiers will be laid out as follow:

[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 32G
[tier 2]: 1 sst of 64G
[tier 3]: 1 sst of 128G
[tier 4]: 1 sst of 256G
[tier 5]: 1 sst of 512G

Now let's say we load 2TB of data spread across many sstables.

If we major compacted them, we'd end up with 1 sstable of 2TB, which if integrated
into the active sstable list, could be left uncompacted for a long time, because
depending on data set size, we couldn't even produce another sstable of a similar size
for the former 2TB sstable to be compacted with.

What if run STCS compactions on this separate sstable list until the compaction
strategy is satisfied?

So the size tiers for this separate sstable list of 2TB would look like as follow when
the strategy is satisfied:

[tier 0]: 1 sst of 16G
[tier 1]: 1 sst of 64G
[tier 2]: 1 sst of 128G
[tier 3]: 1 sst of 256G
[tier 4]: 1 sst of 512G
[tier 5]: 1 sst of 1024G

When we merge the separate sstable list into the active (main) sstable list, it will
look like as follow:

[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 16G
[tier 2]: 1 sst of 32G
[tier 3]: 2 sst of 64G
[tier 4]: 2 sst of 128G
[tier 5]: 2 sst of 256G
[tier 6]: 2 sst of 512G
[tier 7]: 1 sst of 1024G

Then compaction strategy will kick off again, working towards reducing space and
read amplification. If we had integrated a single 2TB sstable instead, we could
end up with a high read and space ampl for a indefinite period of time, until
we run a major compaction for example.

However, it seems to me that, as an optimization, we could run major compaction
on the separate sstable list if its total data size is smaller than the active
sstable list's one, in order to decrease write ampl.

We'll need to properly implement TWCS major compaction for this to succeed, given
that now it breaks the time window invariant by merging all ssts into one.
LCS major compaction already merges all input data into N non-overlapping,
fixed-sstables at the last level.

@glommer
Copy link
Contributor

glommer commented Feb 21, 2020

We need to do cross-tier compactions anyway otherwise tombstones don't go away.

And you're right that TWCS will have to have a specialized version of major compaction.

@raphaelsc
Copy link
Member

raphaelsc commented Feb 21, 2020

We need to think this cross-tier compaction idea through, so as not to increase STCS write amplification when it matters. For example, there's probably no purpose in doing it when the workload is write-only. Also, we shouldn't do it when we're under heavy write workload, unless we're facing a disk pressure scenario and we determine that doing cross-tier compaction will save space by getting rid of obsolete data, or perhaps we want to reduce read ampl. I wouldn't like us to face the same problem when we lowered min threshold to 2 and increased STCS write ampl by a factor of ~2, affecting write perf. So the biggest challenge is determining an appropriate time to trigger cross-tier compaction. Disk space controller will be handy here!

@avikivity
Copy link
Member

We need to do cross-tier compactions anyway otherwise tombstones don't go away.

And you're right that TWCS will have to have a specialized version of major compaction.

That's easy, override get_major_compaction_job() (and use segregate_by_timestamp() from @denesb), that's only some gluing away.

@avikivity
Copy link
Member

Not sure if major compaction is the panacea for integration, although it improves
the current state.

Consider this case for STCS. Let's say we have 1TB data in active sstable list.

The active sstable list is composed of log2(1TB / ~1G) size tiers (with a min
threshold of 2), where the tiers will be laid out as follow:

[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 32G
[tier 2]: 1 sst of 64G
[tier 3]: 1 sst of 128G
[tier 4]: 1 sst of 256G
[tier 5]: 1 sst of 512G

Now let's say we load 2TB of data spread across many sstables.

If we major compacted them, we'd end up with 1 sstable of 2TB, which if integrated
into the active sstable list, could be left uncompacted for a long time, because
depending on data set size, we couldn't even produce another sstable of a similar size
for the former 2TB sstable to be compacted with.

What if run STCS compactions on this separate sstable list until the compaction
strategy is satisfied?

So the size tiers for this separate sstable list of 2TB would look like as follow when
the strategy is satisfied:

[tier 0]: 1 sst of 16G
[tier 1]: 1 sst of 64G
[tier 2]: 1 sst of 128G
[tier 3]: 1 sst of 256G
[tier 4]: 1 sst of 512G
[tier 5]: 1 sst of 1024G

When we merge the separate sstable list into the active (main) sstable list, it will
look like as follow:

[tier 0]: 1 sst of 8G
[tier 1]: 1 sst of 16G
[tier 2]: 1 sst of 32G
[tier 3]: 2 sst of 64G
[tier 4]: 2 sst of 128G
[tier 5]: 2 sst of 256G
[tier 6]: 2 sst of 512G
[tier 7]: 1 sst of 1024G

Then compaction strategy will kick off again, working towards reducing space and
read amplification. If we had integrated a single 2TB sstable instead, we could
end up with a high read and space ampl for a indefinite period of time, until
we run a major compaction for example.

I don't think it's a big problem. You can arrive at a similar situation by normally writing until you get a 2TB sstable, and thendo overwrites and deletions for the lower tiers until their sstables shrink. So yes, you have a large sstable, but it's not harming anyone.

We need to compact it from time to time to get rid of tombstones, but that's independent of import/streaming scenarios. Perhaps tag any sstable that hasn't been touched for some time for low-priority compaction.

asias added a commit to asias/scylla that referenced this issue Feb 25, 2020
The table::add_sstable_and_update_cache is used by streaming and repair to
add sstables generated during the operation.

When we run a node maintenance operation, e.g., bootstrap, repair, a set
of new sstables will be created. For a given token range for a given
table, typically a vnode token range, streaming will create 1 new
sstables per shard, repair will create 1 new sstables per shard on
repair followers and RF - 1 new sstables per shard on repair master.

Currently, we add those new sstables to the main sstables set directly
and let the normal compaction to compact and "reshape" those sstables.
As a result, we effectively push the maintenance work to the compaction
scheduling group, not running under maintenance scheduling group, having
impact on the user read performance. In addition, those sstables without
the invariant properties defined by the compaction strategy may cause
bad read performance.

Before we have the Off-Strategy Compaction which will take a while to be
there, we can mitigate the problem by not triggering a compaction
every time we add a stream generated sstable. Instead we trigger a
compaction when all the streaming is done.

Refs: scylladb#5495 scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884
@raphaelsc
Copy link
Member

BTW, I think off-strategy compaction will be easier with STCS in terms of compaction work (write amplification) after we integrate the major compacted sstable(s) into the table's main sstable set. That's because of how strategies work. STCS major compaction generates a single sstable that will be placed at the biggest size tier. TWCS major compaction will generate one sstable for each time window, which means that compaction will subsequently have to compact two sstables for each time window involved. I don't know how this could be done better, but it may implicate in up to all existing time windows being recompacted. And LCS major compaction generates a new level L, composed of N sstables (a run). L is usually the table's highest level, with nodetool compact. We're left with 2 options IIUC: 1) have off-strategy compaction generate a new level L which is equal to table's highest level, meaning the two will have to be merged when integrated because of overlapping invariant. That's not good because that means basically rewriting the entire table, given that highest level contains about 90% of data set. 2) place the level (run) generated by major compaction at a level which is empty and higher than the table's highest level. This looks like the most viable approach. If the new level is considered unnecessary later on, LCS will eventually start demoting sstables from this new level in a non-aggressive way.

asias added a commit to asias/scylla that referenced this issue Mar 18, 2020
Compaction dominating IO bandwidth makes streaming slow during
bootstrap. To mitigate problem before we have off-strategy compaction
support, we want to reduce the compaction work during bootstrap. The
patch 'compaction: use a larger min_threshold during bootstrap, replace'
reduces the compaction by increasing the min threshold of sstables for
compaction.

In this patch, we reduce the number of sstables generated for bootstrap,
by increasing the ranges per stream plan. The bootstrap node does not
server read, so it does not have cache invalidation problem as it has in
decommission.

Refs: scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884
@raphaelsc
Copy link
Member

Resharding analysis for off-strategy compaction
@avikivity @glommer

Right now, resharding may produce N * smp::count sstables, where N is the number of input shared sstables. That's because resharding may decide to put each shared sstable into a separate resharding job, resulting in each input sstable being resharded individually.

To produce only 1 sstable for each shard, as envisioned by off-strategy compaction, resharding as-is would have to run on a single shard with all input sstables. But we need to take advantage of parallelism.

If we split all resharding work among all shards, with each shard running only 1 resharding job (as a sort of major compaction), smp::count sstables would be produced for each shard. Is that acceptable? If not, after resharding completes, we could run a "major compaction" on the smp::count sstables produced by resharding for a given shard.

avikivity added a commit that referenced this issue Apr 28, 2021
…aphael

"
From now on, offstrategy compaction is triggered on completion of repair-based
removenode. So compaction will no longer act aggressively while removenode
is going on, which helps reducing both latency and operation time.

Refs #5226.
"

* 'offstrategy_removenode' of github.com:raphaelsc/scylla:
  repair: Wire offstrategy compaction to repair-based removenode
  table: introduce trigger_offstrategy_compaction()
  repair/row_level: make operations_supported static const
asias added a commit to asias/scylla that referenced this issue May 7, 2021
When decommission is done, all nodes that receive data from the
decommission node will run node_ops_cmd::decommission_done handler.

Trigger off-strategy compaction inside the handler to wire off-strategy
for decommission.

Refs scylladb#5226
avikivity pushed a commit that referenced this issue May 10, 2021
When decommission is done, all nodes that receive data from the
decommission node will run node_ops_cmd::decommission_done handler.

Trigger off-strategy compaction inside the handler to wire off-strategy
for decommission.

Refs #5226

Closes #8607
avikivity pushed a commit that referenced this issue May 11, 2021
When decommission is done, all nodes that receive data from the
decommission node will run node_ops_cmd::decommission_done handler.

Trigger off-strategy compaction inside the handler to wire off-strategy
for decommission.

Refs #5226

Closes #8607
@bhalevy
Copy link
Member

bhalevy commented May 27, 2021

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

@raphaelsc
Copy link
Member

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

you're absolutely correct, sir.

@bhalevy
Copy link
Member

bhalevy commented Aug 3, 2022

@slivne with the remaining work here I think we can demote to Eng-3 label, or maybe remove the level entirely.

@slivne slivne added Eng-3 and removed Eng-2 labels Aug 3, 2022
@bhalevy bhalevy unassigned cmm Sep 7, 2022
@DoronArazii DoronArazii modified the milestones: 5.1, 5.2 Oct 12, 2022
@DoronArazii DoronArazii modified the milestones: 5.2, 5.x Nov 22, 2022
@mykaul
Copy link
Contributor

mykaul commented Feb 14, 2023

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

you're absolutely correct, sir.

@bhalevy , @raphaelsc - is the above tracked somewhere?

@bhalevy
Copy link
Member

bhalevy commented Feb 14, 2023

After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy.
Is that right @raphaelsc ?

you're absolutely correct, sir.

@bhalevy , @raphaelsc - is the above tracked somewhere?

No, AFAIK. We can open a separate issue for it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants