-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range tombstones should be evicted gradually #3288
Comments
The cache of system.large_{partition,rows,cells} accumulates range tombstones (#7750), and those range tombstones can be evicted only together with their partition (#3288). Making the system.large_* tables uncached should work around the problem until #3288 is fixed. Fixes #8874 Refs #7750 Refs #3288 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
The cache of system.large_{partition,rows,cells} accumulates range tombstones (scylladb#7750), and those range tombstones can be evicted only together with their partition (scylladb#3288). Making the system.large_* tables uncached should work around the problem until scylladb#3288 is fixed. Fixes scylladb#8874 Refs scylladb#7750 Refs scylladb#3288 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
Similar problem in Linux: https://lwn.net/Articles/894098/ |
I think this will become almost naturally solved once we implement the unified v2 internal format. AFAIK we established that we want to store range tombstone markers in/next to rows. In that case, when evicting a row with any range tombstone marker on it, we just move the marker to the next row. If no other row in range, or next row already a marker (current range tombstone ends ther), we drop the range tombstone marker too. |
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
This patch changes mutation_partition_v2 to store range tombstone information together with rows. This mainly affects the version merging algorithm, mutation_partition_v2::apply_monotonically(). Continuity setting no longer can drop dummy entry unconditionally since it may be a boundary of a range tombstone. Memtable/cache is not switched yet. Refs scylladb#10587 Refs scylladb#3288
This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes scylladb#2578 Fixes scylladb#3288 Fixes scylladb#10587
…omasz Grabiec This series switches memtable and cache to use a new representation for mutation data, called `mutation_partition_v2`. In this representation, range tombstone information is stored in the same tree as rows, attached to row entries. Each entry has a new tombstone field, which represents range tombstone part which applies to the interval between this entry and the previous one. See docs/dev/mvcc.md for more details about the format. The transient mutation object still uses the old model in order to avoid work needed to adapt old code to the new model. It may also be a good idea to live with two models, since the transient mutation has different requirements and thus different trade-offs can be made. Transient mutation doesn't need to support eviction and strong exception guarantees, so its algorithms and in-memory representation can be simpler. This allows us to incrementally evict range tombstone information. Before this series, range tombstones were accumulated and evicted only when the whole partition entry was evicted. This could lead to inefficient use of cache memory. Another advantage of the new representation is that reads don't have to lookup range tombstone information in a different tree while reading. This leads to simpler and more efficient readers. There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes. Secondly, update algorithms are more complex because they need to deoverlap range tombstone information. Also, to handle preemption and provide strong exception guarantees, update algorithms may need to allocate sentinel entries, which adds complexity and reduces performance. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Remaining work: - performance optimizations to apply_monotonically() to avoid regressions - performance testing - preemption support in apply_to_incomplete (cache update from memtable) Fixes #2578 Fixes #3288 Fixes #10587 Closes #12048 * github.com:scylladb/scylladb: test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction test: mvcc: Extract mvcc_container::allocate_in_region() row_cache, lru: Introduce evict_shallow() test: mvcc: Avoid copies of mutation under failure injection test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic mutation_partition_v2: Avoid full scan when applying mutation to non-evictable Pass is_evictable to apply() tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1 tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint tests: mutation_partition_v2: Add test for exception safety of mutation merging tests: Add tests for the mutation_partition_v2 model mutation_partition_v2: Implement compact() cache_tracker: Extract insert(mutation_partition_v2&) mvcc, mutation_partition: Document guarantees in case merging succeeds mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically() mutation_partition_v2: Simplify get_continuity() row_cache: Distinguish dummy insertion site in trace log db: Use mutation_partition_v2 in mvcc range_tombstone_change_merger: Introduce peek() readers: Extract range_tombstone_change_merger mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots mvcc: partition_snapshot_row_cursor: Support digest calculation mutation_partition_v2: Store range tombstones together with rows db: Introduce mutation_partition_v2 doc: Introduce docs/dev/mvcc.md db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU db: Print range_tombstone bounds as position_in_partition test: memtable_test: Relax test_segment_migration_during_flush test: cache_flat_mutation_reader: Avoid timestamp clash test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows test: mvcc: Fix sporadic failures due to compact_for_compaction() test: lib: random_mutation_generator: Produce partition tombstone less often test: lib: random_utils: Introduce with_probability() test: lib: Improve error message in has_same_continuity() test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker test: mvcc: Insert entries in the tracker test: mvcc_test: Do not set dummy::no on non-clustering rows mutation_partition: Print full position in error report in append_clustered_row() db: mutation_cleaner: Extract make_region_space_guard() position_in_partition: Optimize equality check mvcc: Fix version merging state resetting mutation_partition: apply_resume: Mark operator bool() as explicit
I assume we will not be backporting this change anywhere. |
Installation details
Scylla version (or git commit hash): any
With row-level eviction (#2576) we now evict partition rows gradually. We never evict range tombstones though, they accumulate and can go away only together with the whole partition. This is problematic, because:
more is evicted than necessary, which decreases cache efficiency. In the worst case, the whole cache gets evicted at once
evicting large amounts of memory at once may impact latency badly
The text was updated successfully, but these errors were encountered: