Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump compaction priority of an index/range if read/space amplification is high #3225

Open
kostja opened this issue Mar 7, 2018 · 2 comments
Labels
Milestone

Comments

@kostja
Copy link
Contributor

kostja commented Mar 7, 2018

Test scenario (Mail.Ru visapi):

  • create an index which is 35GB in size
  • delete all data in an index with expirationd
  • expirationd keeps using a lot of cpu while iterating through garbage and major compaction never happens, while the space is empty.

This happens because we never feed back read amplification cost into the compaction priority of a range. This is an omission, please fix.

How to fix

Measure read amplification and write amplification. Increase compaction priority or simply force major compaction if read amplification is > write amplification * level_size_ratio

Test scenario (cloud.mail.ru)

Run spacer/moonwalker against entire data set and add a field to each tuple. Watch space amplification doubles, while compaction doesn't happen: it doesn't happen because largely the tree shape is not changed by changing all tuples.

How to fix

Write amplification in LSM trees is ~15. It means we create ~15GB of temporary files for each 1GB of written data. Currently we garbage collect these files only after checkpoint. If a file is entirely temporary, e.g. not referenced by any checkpoint, we can collect it without waiting for a checkpoint.

@kostja kostja self-assigned this Mar 7, 2018
@kostja kostja added the bug Something isn't working label Mar 7, 2018
@kostja kostja added this to the 1.10.1 milestone Mar 7, 2018
@kostja kostja added the vinyl label Mar 7, 2018
@kostja kostja changed the title Bump compaction priority of an index/range if read amplification is high Bump compaction priority of an index/range if read or space amplification is high May 17, 2018
@kostja kostja changed the title Bump compaction priority of an index/range if read or space amplification is high Bump compaction priority of an index/range if read/space amplification is high May 17, 2018
@kostja kostja modified the milestones: 1.10.2, 1.10.3 May 30, 2018
@locker
Copy link
Member

locker commented Sep 20, 2018

We actually ran into this problem while investigating #3603: running the random insertion test attached to the above-mentioned issue on a space with two indexes resulted in about 15M records in the primary index and more than 250M records in the secondary index after several hours of run time while the number of unique keys was 10M, i.e. space amplification for the secondary index exceeded 25, which is huge even for an LSM-tree based design. Probably, we need to take into account the number of DELETE statements and force compaction if there are too many of them.

locker added a commit that referenced this issue Oct 20, 2018
This patch adds a new entry to per index statistics reported by
index.stat():

  disk.statement
    inserts
    replaces
    deletes
    upserts

It shows the number of statements of each type stored in run files.
The new statistics are persisted in index files. We will need this
information so that we can force major compaction when there are too
many DELETE statements accumulated in run files.

Needed for #3225
locker added a commit that referenced this issue Oct 20, 2018
Even a perfectly shaped LSM tree can accumulate a huge number of DELETE
statements over time in case indexed fields are frequently updated. This
can significantly increase read and space amplification, especially for
secondary indexes.

One way to deal with it is to propagate read amplification back to the
scheduler so that it can raise compaction priority accordingly. Although
this would probably make sense, it wouldn't be enough, because it
wouldn't deal with space amplification growth in case the workload is
write-mostly.

So this patch simply forces major compaction for a range if the number
of DELETEs plus the statements they are supposed to purge exceeds a half
of all statements in the range. This should prevent uncontrollable growth
of the number of DELETEs and confine space and read amplification to
sane values.

Part of #3225
locker added a commit that referenced this issue Oct 24, 2018
This patch adds a new entry to per index statistics reported by
index.stat():

  disk.statement
    inserts
    replaces
    deletes
    upserts

It shows the number of statements of each type stored in run files.
The new statistics are persisted in index files. We will need this
information so that we can force major compaction when there are too
many DELETE statements accumulated in run files.

Needed for #3225
@kyukhin kyukhin modified the milestones: 1.10.3, 1.10.4 Apr 1, 2019
@locker
Copy link
Member

locker commented May 31, 2019

So, if we didn't have DELETE statements, neither explicit nor implicit (i.e. generated for secondary indexes on tuple update), we wouldn't need to do anything at all, because in this case both read and write and space amplification are limited by the target LSM tree shape. So this issue is about taking into account DELETE statements while calculating compaction priority.

If we didn't have implicit DELETEs, i.e. if there was no secondary indexes, write and space amplification would still be limited by the LSM tree shape, because the number of unique statements in a primary index can be estimated as the number of tuples at the last (largest) LSM tree level. Read amplification could be greater though, because there might be a group of deleted statements requested frequently and bloom filters don't reflect such requests. To handle this, we can't simply trigger compaction when the portion of DELETE statements is high - it may be not enough, because the number of deleted statements may be low while the number of requests for them may be high. We should keep track of such garbage requests and trigger compaction when there are too many of them (probably need to maintain a histogram for that).

The situation gets substantially worse in presence of secondary indexes, because DELETEs may be deferred in this case. For example, suppose we have a tuple (i,j) where the first field (i) is primary and the second field (j) is secondary. If we update j frequently for the same i, we will get a lot of tuples (i,j), j=1,2,... in the secondary index, and they won't be expunged on compaction until DELETEs are generated for them, i.e. until the primary index is compacted, which may take a while. This means that the number of statements at the last LSM tree level may rocket sky high and so does space amplification. We can limit space amplification by triggering compaction when the portion of DELETE statements is high, but just like in case of the primary index only case (see the above paragraph), this isn't enough to keep read amplification limited. What is worse, we can't trigger compaction when we skip too many DELETEs while processing read requests, because deferred DELETEs are completely ignored by low-level iterators due to the implementation of the read iterator (see VY_STMT_SKIP_READ flag). We can't trigger compaction when the number of skipped stale statements is high either - we can't know for sure if DELETEs have been generated for them so we don't know if compaction will help at all. I suggest the following solution:

  1. Get rid of VY_STMT_SKIP_READ flag and don't ignore deferred DELETE statements in secondary indexes. This may prove tricky, because we will need to teach the read iterator to handle out of order statements (currently, it assumes that newer sources always store newer statements while with deferred DELETEs it isn't necessarily true), but I think it should be possible. We will probably have to implement a heap to order statements to achieve that (like in the write iterator). This is good to do anyway, because it should speed up secondary index lookups, because we won't need to do primary index lookup for those stale statements for which deferred DELETEs have been generated.

  2. Account skipped DELETE statements. If there are two many of them, trigger compaction of the corresponding secondary index range. What's "too many"? I guess something like, "more than N skipped DELETEs per read statements" in the worst 99% case. N is empirical.

  3. We should also account REPLACE statements skipped because they are stale, i.e. they have been updated in the primary index, but DELETEs haven't been generated yet. If there are too many of them (see above for the definition of "too many"), we should trigger compaction of the primary index. Which ranges to compact in the primary index? Those these keys fall into (i.e. we need to account skipped statements per primary index range). How many runs to compact? Enough to include LSNs of skipped statements (this is tricky to account, will probably need to estimate that somehow).

Apart from that, we should also trigger compaction when the portion of DELETEs is high to keep space amplification of secondary indexes low even if there is no read requests. For example, if for some level the number of deferred DELETEs N is lower than the number of REPLACEs M, but greater than M/2, trigger compaction of those runs. Note, we shouldn't compact DELETEs if there are substantially more of them than REPLACEs as this can't possibly purge them.

@kyukhin kyukhin added performance feature A new functionality and removed bug Something isn't working labels Jul 22, 2019
@kyukhin kyukhin modified the milestones: 1.10.4, 2.4.0 Jul 22, 2019
@kyukhin kyukhin modified the milestones: 2.4.1, 2.5.1 Jan 23, 2020
@kyukhin kyukhin modified the milestones: 2.5.1, 2.6.1 Apr 10, 2020
@kyukhin kyukhin modified the milestones: 2.6.1, wishlist Oct 23, 2020
nshy added a commit to nshy/tarantool that referenced this issue Jan 19, 2024
In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, 5430, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for tarantoolgh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of #tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this issue Jan 19, 2024
In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, 5430, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for tarantoolgh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of #tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this issue Jan 24, 2024
In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for tarantoolgh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this issue Jan 25, 2024
In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for tarantoolgh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this issue Jan 26, 2024
In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for tarantoolgh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
locker pushed a commit that referenced this issue Jan 29, 2024
In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for gh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of #8423

NO_CHANGELOG=internal
NO_DOC=internal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants