New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Off-Strategy Compaction #5226
Comments
@bhalevy queue as part of the compaction work ... |
Please reference this issue for all the compaction related read/write workload performance drop during node operations, e.g., bootstrap, decommission, repair. |
The table::add_sstable_and_update_cache is used by streaming and repair to add sstables generated during the operation. When we run a node maintenance operation, e.g., bootstrap, repair, a set of new sstables will be created. For a given token range for a given table, typically a vnode token range, streaming will create 1 new sstables per shard, repair will create 1 new sstables per shard on repair followers and RF - 1 new sstables per shard on repair master. Currently, we add those new sstables to the main sstables set directly and let the normal compaction to compact and "reshape" those sstables. As a result, we effectively push the maintenance work to the compaction scheduling group, not running under maintenance scheduling group, having impact on the user read performance. In addition, those sstables without the invariant properties defined by the compaction strategy may cause bad read performance. Before we have the Off-Strategy Compaction which will take a while to be there, we can mitigate the problem by not triggering a compaction every time we add a stream/repair generated sstable. Refs: scylladb#5226 scylladb#5199 scylladb#5109
The table::add_sstable_and_update_cache is used by streaming and repair to add sstables generated during the operation. When we run a node maintenance operation, e.g., bootstrap, repair, a set of new sstables will be created. For a given token range for a given table, typically a vnode token range, streaming will create 1 new sstables per shard, repair will create 1 new sstables per shard on repair followers and RF - 1 new sstables per shard on repair master. Currently, we add those new sstables to the main sstables set directly and let the normal compaction to compact and "reshape" those sstables. As a result, we effectively push the maintenance work to the compaction scheduling group, not running under maintenance scheduling group, having impact on the user read performance. In addition, those sstables without the invariant properties defined by the compaction strategy may cause bad read performance. Before we have the Off-Strategy Compaction which will take a while to be there, we can mitigate the problem by not triggering a compaction every time we add a stream/repair generated sstable. Refs: scylladb#5495 scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884
Not sure if major compaction is the panacea for integration, although it improves Consider this case for STCS. Let's say we have 1TB data in active sstable list. The active sstable list is composed of
Now let's say we load 2TB of data spread across many sstables. If we major compacted them, we'd end up with 1 sstable of 2TB, which if integrated What if run STCS compactions on this separate sstable list until the compaction So the size tiers for this separate sstable list of 2TB would look like as follow when
When we merge the separate sstable list into the active (main) sstable list, it will
Then compaction strategy will kick off again, working towards reducing space and However, it seems to me that, as an optimization, we could run major compaction We'll need to properly implement TWCS major compaction for this to succeed, given |
We need to do cross-tier compactions anyway otherwise tombstones don't go away. And you're right that TWCS will have to have a specialized version of major compaction. |
We need to think this cross-tier compaction idea through, so as not to increase STCS write amplification when it matters. For example, there's probably no purpose in doing it when the workload is write-only. Also, we shouldn't do it when we're under heavy write workload, unless we're facing a disk pressure scenario and we determine that doing cross-tier compaction will save space by getting rid of obsolete data, or perhaps we want to reduce read ampl. I wouldn't like us to face the same problem when we lowered min threshold to 2 and increased STCS write ampl by a factor of ~2, affecting write perf. So the biggest challenge is determining an appropriate time to trigger cross-tier compaction. Disk space controller will be handy here! |
That's easy, override get_major_compaction_job() (and use |
I don't think it's a big problem. You can arrive at a similar situation by normally writing until you get a 2TB sstable, and thendo overwrites and deletions for the lower tiers until their sstables shrink. So yes, you have a large sstable, but it's not harming anyone. We need to compact it from time to time to get rid of tombstones, but that's independent of import/streaming scenarios. Perhaps tag any sstable that hasn't been touched for some time for low-priority compaction. |
The table::add_sstable_and_update_cache is used by streaming and repair to add sstables generated during the operation. When we run a node maintenance operation, e.g., bootstrap, repair, a set of new sstables will be created. For a given token range for a given table, typically a vnode token range, streaming will create 1 new sstables per shard, repair will create 1 new sstables per shard on repair followers and RF - 1 new sstables per shard on repair master. Currently, we add those new sstables to the main sstables set directly and let the normal compaction to compact and "reshape" those sstables. As a result, we effectively push the maintenance work to the compaction scheduling group, not running under maintenance scheduling group, having impact on the user read performance. In addition, those sstables without the invariant properties defined by the compaction strategy may cause bad read performance. Before we have the Off-Strategy Compaction which will take a while to be there, we can mitigate the problem by not triggering a compaction every time we add a stream generated sstable. Instead we trigger a compaction when all the streaming is done. Refs: scylladb#5495 scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884
BTW, I think off-strategy compaction will be easier with STCS in terms of compaction work (write amplification) after we integrate the major compacted sstable(s) into the table's main sstable set. That's because of how strategies work. STCS major compaction generates a single sstable that will be placed at the biggest size tier. TWCS major compaction will generate one sstable for each time window, which means that compaction will subsequently have to compact two sstables for each time window involved. I don't know how this could be done better, but it may implicate in up to all existing time windows being recompacted. And LCS major compaction generates a new level L, composed of N sstables (a run). L is usually the table's highest level, with nodetool compact. We're left with 2 options IIUC: 1) have off-strategy compaction generate a new level L which is equal to table's highest level, meaning the two will have to be merged when integrated because of overlapping invariant. That's not good because that means basically rewriting the entire table, given that highest level contains about 90% of data set. 2) place the level (run) generated by major compaction at a level which is empty and higher than the table's highest level. This looks like the most viable approach. If the new level is considered unnecessary later on, LCS will eventually start demoting sstables from this new level in a non-aggressive way. |
Compaction dominating IO bandwidth makes streaming slow during bootstrap. To mitigate problem before we have off-strategy compaction support, we want to reduce the compaction work during bootstrap. The patch 'compaction: use a larger min_threshold during bootstrap, replace' reduces the compaction by increasing the min threshold of sstables for compaction. In this patch, we reduce the number of sstables generated for bootstrap, by increasing the ranges per stream plan. The bootstrap node does not server read, so it does not have cache invalidation problem as it has in decommission. Refs: scylladb#5226 scylladb#5199 scylladb#5109 scylladb#4884
Resharding analysis for off-strategy compaction Right now, resharding may produce N * smp::count sstables, where N is the number of input shared sstables. That's because resharding may decide to put each shared sstable into a separate resharding job, resulting in each input sstable being resharded individually. To produce only 1 sstable for each shard, as envisioned by off-strategy compaction, resharding as-is would have to run on a single shard with all input sstables. But we need to take advantage of parallelism. If we split all resharding work among all shards, with each shard running only 1 resharding job (as a sort of major compaction), smp::count sstables would be produced for each shard. Is that acceptable? If not, after resharding completes, we could run a "major compaction" on the smp::count sstables produced by resharding for a given shard. |
…aphael " From now on, offstrategy compaction is triggered on completion of repair-based removenode. So compaction will no longer act aggressively while removenode is going on, which helps reducing both latency and operation time. Refs #5226. " * 'offstrategy_removenode' of github.com:raphaelsc/scylla: repair: Wire offstrategy compaction to repair-based removenode table: introduce trigger_offstrategy_compaction() repair/row_level: make operations_supported static const
When decommission is done, all nodes that receive data from the decommission node will run node_ops_cmd::decommission_done handler. Trigger off-strategy compaction inside the handler to wire off-strategy for decommission. Refs scylladb#5226
After #8677 it looks like the only thing left to close this issue is to wire off-strategy compaction after changing compaction strategy. |
you're absolutely correct, sir. |
@slivne with the remaining work here I think we can demote to |
@bhalevy , @raphaelsc - is the above tracked somewhere? |
No, AFAIK. We can open a separate issue for it if needed. |
From: @avikivity
A compaction strategy attempts to serve user workloads with certain characteristics by maintaining certain invariants that guarantee some performance level or other metric. For example, Time Window Compaction Strategy is tuned for time series workloads, maintains the invariant that no sstables cross time windows, and guarantees that the write amplification is O(log(window size / memory size)) rather than size-tiered’s O(log(data size / memory size)). The general-purpose size-tiered compaction strategy maintains the invariant that the number of sstables of similar size is low, and so compaction strategy component of read amplification and write amplification are both O(log(disk size / memory size)).
We define Off-Strategy Compactions as a compaction happening at a state where the invariant is broken. The invariant can be broken for the following reasons:
While the last case is a user error, the other cases happen during normal or semi-normal operation and will result in degraded performance, without any ability on the user’s part to mitigate the effects.
Types of off-strategy compactions
Proposed Solution:
We will enhance the usage of the staging directory (currently used by Materialized Views code_ - so that all this cases will place the ssatbles in the staging directory.
The benefit of using the staging directory is that we can serve reads from this sstables (using a sepearet sstable_set) and also major compact them before moving them to be compacted with other sstables.
The text was updated successfully, but these errors were encountered: