Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range tombstones should be evicted gradually #3288

Closed
tgrabiec opened this issue Mar 14, 2018 · 3 comments · Fixed by #12048
Closed

Range tombstones should be evicted gradually #3288

tgrabiec opened this issue Mar 14, 2018 · 3 comments · Fixed by #12048

Comments

@tgrabiec
Copy link
Contributor

Installation details
Scylla version (or git commit hash): any

With row-level eviction (#2576) we now evict partition rows gradually. We never evict range tombstones though, they accumulate and can go away only together with the whole partition. This is problematic, because:

  • more is evicted than necessary, which decreases cache efficiency. In the worst case, the whole cache gets evicted at once

  • evicting large amounts of memory at once may impact latency badly

@tgrabiec tgrabiec self-assigned this Mar 14, 2018
@slivne slivne added this to the 2.x milestone Jun 24, 2018
@slivne slivne added the Eng-2 label Oct 27, 2019
@slivne slivne modified the milestones: 4.x, 4.3 May 25, 2020
@slivne slivne modified the milestones: 4.3, 4.4 Nov 26, 2020
@slivne slivne modified the milestones: 4.4, 4.5 Mar 29, 2021
avikivity pushed a commit that referenced this issue Jun 24, 2021
The cache of system.large_{partition,rows,cells} accumulates range
tombstones (#7750), and those
range tombstones can be evicted only together with their partition
(#3288).

Making the system.large_* tables uncached should work around the
problem until #3288 is fixed.

Fixes #8874
Refs #7750
Refs #3288

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
lauranovich pushed a commit to lauranovich/scylla that referenced this issue Jul 29, 2021
The cache of system.large_{partition,rows,cells} accumulates range
tombstones (scylladb#7750), and those
range tombstones can be evicted only together with their partition
(scylladb#3288).

Making the system.large_* tables uncached should work around the
problem until scylladb#3288 is fixed.

Fixes scylladb#8874
Refs scylladb#7750
Refs scylladb#3288

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
@slivne slivne modified the milestones: 4.5, 4.6 Nov 10, 2021
@slivne slivne modified the milestones: 4.6, 5.0 Feb 8, 2022
@avikivity
Copy link
Member

Similar problem in Linux: https://lwn.net/Articles/894098/

@denesb
Copy link
Contributor

denesb commented Aug 17, 2022

I think this will become almost naturally solved once we implement the unified v2 internal format. AFAIK we established that we want to store range tombstone markers in/next to rows. In that case, when evicting a row with any range tombstone marker on it, we just move the marker to the next row. If no other row in range, or next row already a marker (current range tombstone ends ther), we drop the range tombstone marker too.

@DoronArazii DoronArazii modified the milestones: 5.0, 5.x Oct 12, 2022
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Nov 22, 2022
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Nov 22, 2022
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
@enedil enedil assigned zdzblo-trawy and unassigned enedil Dec 19, 2022
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 21, 2022
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 21, 2022
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Dec 22, 2022
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 10, 2023
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 10, 2023
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 12, 2023
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 12, 2023
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 16, 2023
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 16, 2023
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 21, 2023
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 21, 2023
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 27, 2023
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.

This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().

Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.

Memtable/cache is not switched yet.

Refs scylladb#10587
Refs scylladb#3288
tgrabiec added a commit to tgrabiec/scylla that referenced this issue Jan 27, 2023
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Range tombstone eviction in cache has now fine granularity, like with
rows.

Fixes scylladb#2578
Fixes scylladb#3288
Fixes scylladb#10587
avikivity added a commit that referenced this issue Feb 5, 2023
…omasz Grabiec

This series switches memtable and cache to use a new representation for mutation data,
called `mutation_partition_v2`. In this representation, range tombstone information is stored
in the same tree as rows, attached to row entries. Each entry has a new tombstone field,
which represents range tombstone part which applies to the interval between this entry and
the previous one. See docs/dev/mvcc.md for more details about the format.

The transient mutation object still uses the old model in order to avoid work needed to adapt
old code to the new model. It may also be a good idea to live with two models, since the
transient mutation has different requirements and thus different trade-offs can be made.
Transient mutation doesn't need to support eviction and strong exception guarantees,
so its algorithms and in-memory representation can be simpler.

This allows us to incrementally evict range tombstone information. Before this series,
range tombstones were accumulated and evicted only when the whole partition entry was evicted. This
could lead to inefficient use of cache memory.

Another advantage of the new representation is that reads don't have to lookup
range tombstone information in a different tree while reading. This leads to simpler
and more efficient readers.

There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes.
Secondly, update algorithms are more complex because they need to deoverlap range tombstone
information. Also, to handle preemption and provide strong exception guarantees, update
algorithms may need to allocate sentinel entries, which adds complexity and reduces performance.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Remaining work:
  - performance optimizations to apply_monotonically() to avoid regressions
  - performance testing
  - preemption support in apply_to_incomplete (cache update from memtable)

Fixes #2578
Fixes #3288
Fixes #10587

Closes #12048

* github.com:scylladb/scylladb:
  test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction
  test: mvcc: Extract mvcc_container::allocate_in_region()
  row_cache, lru: Introduce evict_shallow()
  test: mvcc: Avoid copies of mutation under failure injection
  test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic
  mutation_partition_v2: Avoid full scan when applying mutation to non-evictable
  Pass is_evictable to apply()
  tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1
  tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint
  tests: mutation_partition_v2: Add test for exception safety of mutation merging
  tests: Add tests for the mutation_partition_v2 model
  mutation_partition_v2: Implement compact()
  cache_tracker: Extract insert(mutation_partition_v2&)
  mvcc, mutation_partition: Document guarantees in case merging succeeds
  mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically()
  mutation_partition_v2: Simplify get_continuity()
  row_cache: Distinguish dummy insertion site in trace log
  db: Use mutation_partition_v2 in mvcc
  range_tombstone_change_merger: Introduce peek()
  readers: Extract range_tombstone_change_merger
  mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots
  mvcc: partition_snapshot_row_cursor: Support digest calculation
  mutation_partition_v2: Store range tombstones together with rows
  db: Introduce mutation_partition_v2
  doc: Introduce docs/dev/mvcc.md
  db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU
  db: Print range_tombstone bounds as position_in_partition
  test: memtable_test: Relax test_segment_migration_during_flush
  test: cache_flat_mutation_reader: Avoid timestamp clash
  test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows
  test: mvcc: Fix sporadic failures due to compact_for_compaction()
  test: lib: random_mutation_generator: Produce partition tombstone less often
  test: lib: random_utils: Introduce with_probability()
  test: lib: Improve error message in has_same_continuity()
  test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker
  test: mvcc: Insert entries in the tracker
  test: mvcc_test: Do not set dummy::no on non-clustering rows
  mutation_partition: Print full position in error report in append_clustered_row()
  db: mutation_cleaner: Extract make_region_space_guard()
  position_in_partition: Optimize equality check
  mvcc: Fix version merging state resetting
  mutation_partition: apply_resume: Mark operator bool() as explicit
@DoronArazii DoronArazii modified the milestones: 5.x, 5.3 Mar 15, 2023
@mykaul
Copy link
Contributor

mykaul commented Mar 21, 2023

I assume we will not be backporting this change anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment