Bump compaction priority of an index/range if read/space amplification is high #3225

kostja · 2018-03-07T22:21:21Z

Test scenario (Mail.Ru visapi):

create an index which is 35GB in size
delete all data in an index with expirationd
expirationd keeps using a lot of cpu while iterating through garbage and major compaction never happens, while the space is empty.

This happens because we never feed back read amplification cost into the compaction priority of a range. This is an omission, please fix.

How to fix

Measure read amplification and write amplification. Increase compaction priority or simply force major compaction if read amplification is > write amplification * level_size_ratio

Test scenario (cloud.mail.ru)

Run spacer/moonwalker against entire data set and add a field to each tuple. Watch space amplification doubles, while compaction doesn't happen: it doesn't happen because largely the tree shape is not changed by changing all tuples.

How to fix

Write amplification in LSM trees is ~15. It means we create ~15GB of temporary files for each 1GB of written data. Currently we garbage collect these files only after checkpoint. If a file is entirely temporary, e.g. not referenced by any checkpoint, we can collect it without waiting for a checkpoint.

locker · 2018-09-20T08:36:16Z

We actually ran into this problem while investigating #3603: running the random insertion test attached to the above-mentioned issue on a space with two indexes resulted in about 15M records in the primary index and more than 250M records in the secondary index after several hours of run time while the number of unique keys was 10M, i.e. space amplification for the secondary index exceeded 25, which is huge even for an LSM-tree based design. Probably, we need to take into account the number of DELETE statements and force compaction if there are too many of them.

This patch adds a new entry to per index statistics reported by index.stat(): disk.statement inserts replaces deletes upserts It shows the number of statements of each type stored in run files. The new statistics are persisted in index files. We will need this information so that we can force major compaction when there are too many DELETE statements accumulated in run files. Needed for #3225

Even a perfectly shaped LSM tree can accumulate a huge number of DELETE statements over time in case indexed fields are frequently updated. This can significantly increase read and space amplification, especially for secondary indexes. One way to deal with it is to propagate read amplification back to the scheduler so that it can raise compaction priority accordingly. Although this would probably make sense, it wouldn't be enough, because it wouldn't deal with space amplification growth in case the workload is write-mostly. So this patch simply forces major compaction for a range if the number of DELETEs plus the statements they are supposed to purge exceeds a half of all statements in the range. This should prevent uncontrollable growth of the number of DELETEs and confine space and read amplification to sane values. Part of #3225

This patch adds a new entry to per index statistics reported by index.stat(): disk.statement inserts replaces deletes upserts It shows the number of statements of each type stored in run files. The new statistics are persisted in index files. We will need this information so that we can force major compaction when there are too many DELETE statements accumulated in run files. Needed for #3225

locker · 2019-05-31T14:38:32Z

So, if we didn't have DELETE statements, neither explicit nor implicit (i.e. generated for secondary indexes on tuple update), we wouldn't need to do anything at all, because in this case both read and write and space amplification are limited by the target LSM tree shape. So this issue is about taking into account DELETE statements while calculating compaction priority.

If we didn't have implicit DELETEs, i.e. if there was no secondary indexes, write and space amplification would still be limited by the LSM tree shape, because the number of unique statements in a primary index can be estimated as the number of tuples at the last (largest) LSM tree level. Read amplification could be greater though, because there might be a group of deleted statements requested frequently and bloom filters don't reflect such requests. To handle this, we can't simply trigger compaction when the portion of DELETE statements is high - it may be not enough, because the number of deleted statements may be low while the number of requests for them may be high. We should keep track of such garbage requests and trigger compaction when there are too many of them (probably need to maintain a histogram for that).

The situation gets substantially worse in presence of secondary indexes, because DELETEs may be deferred in this case. For example, suppose we have a tuple (i,j) where the first field (i) is primary and the second field (j) is secondary. If we update j frequently for the same i, we will get a lot of tuples (i,j), j=1,2,... in the secondary index, and they won't be expunged on compaction until DELETEs are generated for them, i.e. until the primary index is compacted, which may take a while. This means that the number of statements at the last LSM tree level may rocket sky high and so does space amplification. We can limit space amplification by triggering compaction when the portion of DELETE statements is high, but just like in case of the primary index only case (see the above paragraph), this isn't enough to keep read amplification limited. What is worse, we can't trigger compaction when we skip too many DELETEs while processing read requests, because deferred DELETEs are completely ignored by low-level iterators due to the implementation of the read iterator (see VY_STMT_SKIP_READ flag). We can't trigger compaction when the number of skipped stale statements is high either - we can't know for sure if DELETEs have been generated for them so we don't know if compaction will help at all. I suggest the following solution:

Get rid of VY_STMT_SKIP_READ flag and don't ignore deferred DELETE statements in secondary indexes. This may prove tricky, because we will need to teach the read iterator to handle out of order statements (currently, it assumes that newer sources always store newer statements while with deferred DELETEs it isn't necessarily true), but I think it should be possible. We will probably have to implement a heap to order statements to achieve that (like in the write iterator). This is good to do anyway, because it should speed up secondary index lookups, because we won't need to do primary index lookup for those stale statements for which deferred DELETEs have been generated.
Account skipped DELETE statements. If there are two many of them, trigger compaction of the corresponding secondary index range. What's "too many"? I guess something like, "more than N skipped DELETEs per read statements" in the worst 99% case. N is empirical.
We should also account REPLACE statements skipped because they are stale, i.e. they have been updated in the primary index, but DELETEs haven't been generated yet. If there are too many of them (see above for the definition of "too many"), we should trigger compaction of the primary index. Which ranges to compact in the primary index? Those these keys fall into (i.e. we need to account skipped statements per primary index range). How many runs to compact? Enough to include LSNs of skipped statements (this is tricky to account, will probably need to estimate that somehow).

Apart from that, we should also trigger compaction when the portion of DELETEs is high to keep space amplification of secondary indexes low even if there is no read requests. For example, if for some level the number of deferred DELETEs N is lower than the number of REPLACEs M, but greater than M/2, trigger compaction of those runs. Note, we shouldn't compact DELETEs if there are substantially more of them than REPLACEs as this can't possibly purge them.

In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, 5430, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for tarantoolgh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of #tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal

In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for tarantoolgh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal

In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for gh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of #8423 NO_CHANGELOG=internal NO_DOC=internal

kostja self-assigned this Mar 7, 2018

kostja added the bug Something isn't working label Mar 7, 2018

kostja added this to the 1.10.1 milestone Mar 7, 2018

kostja assigned locker Mar 7, 2018

kostja added the vinyl label Mar 7, 2018

kostja changed the title ~~Bump compaction priority of an index/range if read amplification is high~~ Bump compaction priority of an index/range if read or space amplification is high May 17, 2018

kostja changed the title ~~Bump compaction priority of an index/range if read or space amplification is high~~ Bump compaction priority of an index/range if read/space amplification is high May 17, 2018

kostja modified the milestones: 1.10.2, 1.10.3 May 30, 2018

kyukhin modified the milestones: 1.10.3, 1.10.4 Apr 1, 2019

kyukhin added performance feature A new functionality and removed bug Something isn't working labels Jul 22, 2019

kyukhin modified the milestones: 1.10.4, 2.4.0 Jul 22, 2019

kyukhin modified the milestones: 2.4.1, 2.5.1 Jan 23, 2020

kyukhin unassigned kostja and locker Apr 10, 2020

kyukhin modified the milestones: 2.5.1, 2.6.1 Apr 10, 2020

kyukhin modified the milestones: 2.6.1, wishlist Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump compaction priority of an index/range if read/space amplification is high #3225

Bump compaction priority of an index/range if read/space amplification is high #3225

kostja commented Mar 7, 2018 •

edited

locker commented Sep 20, 2018

locker commented May 31, 2019

Bump compaction priority of an index/range if read/space amplification is high #3225

Bump compaction priority of an index/range if read/space amplification is high #3225

Comments

kostja commented Mar 7, 2018 • edited

Test scenario (Mail.Ru visapi):

How to fix

Test scenario (cloud.mail.ru)

How to fix

locker commented Sep 20, 2018

locker commented May 31, 2019

kostja commented Mar 7, 2018 •

edited