Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging of large partition versions in MVCC is inefficient #2715

Closed
tgrabiec opened this Issue Aug 21, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@tgrabiec
Copy link
Contributor

tgrabiec commented Aug 21, 2017

If readers are short-lived relative to the time it takes to fill a memtable (likely), we will have small versions at the front and a big one at the back. Current merging in partition_snapshot::merge_partition_versions() merges older into newer, which is O(src mutation size). This makes merging very inefficient if we accumulated a large partition. We should merge smaller partition into larger one, while preserving asymmetry of the merging operator.

I observed this causing large latencies in perf_cache_eviction -c1 -m4G after reads start to miss:

Before:

rd/s: 9, wr/s: 71369, ev/s: 0, pmerge/s: 0, miss/s: 7, cache: 472/555 [MB], LSA: 532/627 [MB], std free: 3411 [MB]

reads : min: 124   , 50%: 8239  , 90%: 11864 , 99%: 11864 , 99.9%: 11864 , max: 11864  [us]
writes: min: 3     , 50%: 4     , 90%: 4     , 99%: 6     , 99.9%: 4768  , max: 5722   [us]

After:

rd/s: 3, wr/s: 49663, ev/s: 0, pmerge/s: 0, miss/s: 3, cache: 0/0 [MB], LSA: 338/402 [MB], std free: 70 [MB]

reads : min: 315852, 50%: 315852, 90%: 379022, 99%: 379022, 99.9%: 379022, max: 379022 [us]
writes: min: 3     , 50%: 4     , 90%: 4     , 99%: 5     , 99.9%: 17    , max: 379022 [us]

screenshot from 2017-08-21 16-18-49

@tgrabiec tgrabiec changed the title Merging of large partitions in MVCC is inefficient Merging of large partition versions in MVCC is inefficient Aug 21, 2017

@tgrabiec

This comment has been minimized.

Copy link
Contributor Author

tgrabiec commented Aug 21, 2017

Refs #2379

@tgrabiec tgrabiec added this to the 2.1 milestone Sep 21, 2017

@haaawk

This comment has been minimized.

Copy link
Member

haaawk commented Sep 25, 2017

I'm looking into it.

@tgrabiec tgrabiec assigned slivne and haaawk and unassigned slivne Sep 25, 2017

@haaawk

This comment has been minimized.

Copy link
Member

haaawk commented Sep 25, 2017

I'm having troubles reproducing the issue. Everytime I run the tests, latencies are good.

@tgrabiec

This comment has been minimized.

Copy link
Contributor Author

tgrabiec commented Sep 25, 2017

It's probably because the read always hits, so they end before writes can be applied on top of them, so we only have one version ever. Do you see any miss/s? In my case there was a miss due to cache being evicted.

@haaawk

This comment has been minimized.

Copy link
Member

haaawk commented Sep 25, 2017

I don't see any misses. I do see some pmerge though.

@tgrabiec

This comment has been minimized.

Copy link
Contributor Author

tgrabiec commented Sep 25, 2017

@haaawk It's probably because I've changed cache to be continuous when table is created, so you will never miss unless your cache overflows and triggers eviction. Do you see cache occupancy growing?

@haaawk

This comment has been minimized.

Copy link
Member

haaawk commented Sep 25, 2017

@tgrabiec It can be. I don't see cache growing at all.

@slivne

This comment has been minimized.

Copy link
Contributor

slivne commented Sep 25, 2017

@slivne slivne modified the milestones: 2.1, 2.x Dec 31, 2017

@avikivity avikivity closed this in 60d3c25 Jan 21, 2018

avikivity added a commit that referenced this issue Jan 21, 2018

Merge "Reverse order of version merging in MVCC" from Tomasz
"Changes merging in MVCC to apply newer version to older instead of older to
newer.

Before (v0 = oldest):

  (((v3 + v2) + v1) + v0)

After:

  (v0 + (v1 + (v2 + v3)))

or:

  (((v0 + v1) + v2) + v3)

There are several reasons to do this:

  1) When continuity merging will change semantics to support eviction
     from older versions, it will be easier to implement apply() if we
     can assume that we merge newer to older instead of older to
     newer, since newer version may have entries falling into a
     continuous interval in older, but not the other way around. If we
     didn't revert the order, apply() would have to keep track of
     lower bound of a continuous interval in the right-hand side
     argument (older version) as it is applied and update continuity
     flags in the left hand side by scanning all entries overlapping
     with it. If order is reversed, merging only needs to deal with
     the current entry. Also, if we were to keep the old order, we
     cannot simply move entries from the left hand side as we merge
     because we need to keep track of the lower bound of a continuous
     interval, and we need to provide monotonic exception
     guarantees. So merging would be both more complicated and slower.

  2) With large partitions older versions are typically larger than
     newer versions, and since merging is O(N_right*(1 + log(N_left))),
     it's better to merge newer into older.
     This fixes latency spikes seen in perf_cache_eviction.

Fixes #2715."

* tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev:
  mvcc: Reverse order of version merging
  anchorless_list: Introduce last()
  mvcc: Implement partition_entry::upgrade() using squashed()
  mvcc: Extract version merging functions
  mutation_partition: Add rows_entry::set_dummy()
  position_in_partition: Introduce after_key()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.