New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump compaction priority of an index/range if read/space amplification is high #3225
Comments
We actually ran into this problem while investigating #3603: running the random insertion test attached to the above-mentioned issue on a space with two indexes resulted in about 15M records in the primary index and more than 250M records in the secondary index after several hours of run time while the number of unique keys was 10M, i.e. space amplification for the secondary index exceeded 25, which is huge even for an LSM-tree based design. Probably, we need to take into account the number of DELETE statements and force compaction if there are too many of them. |
This patch adds a new entry to per index statistics reported by index.stat(): disk.statement inserts replaces deletes upserts It shows the number of statements of each type stored in run files. The new statistics are persisted in index files. We will need this information so that we can force major compaction when there are too many DELETE statements accumulated in run files. Needed for #3225
Even a perfectly shaped LSM tree can accumulate a huge number of DELETE statements over time in case indexed fields are frequently updated. This can significantly increase read and space amplification, especially for secondary indexes. One way to deal with it is to propagate read amplification back to the scheduler so that it can raise compaction priority accordingly. Although this would probably make sense, it wouldn't be enough, because it wouldn't deal with space amplification growth in case the workload is write-mostly. So this patch simply forces major compaction for a range if the number of DELETEs plus the statements they are supposed to purge exceeds a half of all statements in the range. This should prevent uncontrollable growth of the number of DELETEs and confine space and read amplification to sane values. Part of #3225
This patch adds a new entry to per index statistics reported by index.stat(): disk.statement inserts replaces deletes upserts It shows the number of statements of each type stored in run files. The new statistics are persisted in index files. We will need this information so that we can force major compaction when there are too many DELETE statements accumulated in run files. Needed for #3225
So, if we didn't have DELETE statements, neither explicit nor implicit (i.e. generated for secondary indexes on tuple update), we wouldn't need to do anything at all, because in this case both read and write and space amplification are limited by the target LSM tree shape. So this issue is about taking into account DELETE statements while calculating compaction priority. If we didn't have implicit DELETEs, i.e. if there was no secondary indexes, write and space amplification would still be limited by the LSM tree shape, because the number of unique statements in a primary index can be estimated as the number of tuples at the last (largest) LSM tree level. Read amplification could be greater though, because there might be a group of deleted statements requested frequently and bloom filters don't reflect such requests. To handle this, we can't simply trigger compaction when the portion of DELETE statements is high - it may be not enough, because the number of deleted statements may be low while the number of requests for them may be high. We should keep track of such garbage requests and trigger compaction when there are too many of them (probably need to maintain a histogram for that). The situation gets substantially worse in presence of secondary indexes, because DELETEs may be deferred in this case. For example, suppose we have a tuple (i,j) where the first field (i) is primary and the second field (j) is secondary. If we update j frequently for the same i, we will get a lot of tuples (i,j), j=1,2,... in the secondary index, and they won't be expunged on compaction until DELETEs are generated for them, i.e. until the primary index is compacted, which may take a while. This means that the number of statements at the last LSM tree level may rocket sky high and so does space amplification. We can limit space amplification by triggering compaction when the portion of DELETE statements is high, but just like in case of the primary index only case (see the above paragraph), this isn't enough to keep read amplification limited. What is worse, we can't trigger compaction when we skip too many DELETEs while processing read requests, because deferred DELETEs are completely ignored by low-level iterators due to the implementation of the read iterator (see VY_STMT_SKIP_READ flag). We can't trigger compaction when the number of skipped stale statements is high either - we can't know for sure if DELETEs have been generated for them so we don't know if compaction will help at all. I suggest the following solution:
Apart from that, we should also trigger compaction when the portion of DELETEs is high to keep space amplification of secondary indexes low even if there is no read requests. For example, if for some level the number of deferred DELETEs N is lower than the number of REPLACEs M, but greater than M/2, trigger compaction of those runs. Note, we shouldn't compact DELETEs if there are substantially more of them than REPLACEs as this can't possibly purge them. |
In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, 5430, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for tarantoolgh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of #tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, 5430, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for tarantoolgh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of #tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for tarantoolgh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for tarantoolgh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for tarantoolgh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for gh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of #8423 NO_CHANGELOG=internal NO_DOC=internal
Test scenario (Mail.Ru visapi):
This happens because we never feed back read amplification cost into the compaction priority of a range. This is an omission, please fix.
How to fix
Measure read amplification and write amplification. Increase compaction priority or simply force major compaction if read amplification is > write amplification * level_size_ratio
Test scenario (cloud.mail.ru)
Run spacer/moonwalker against entire data set and add a field to each tuple. Watch space amplification doubles, while compaction doesn't happen: it doesn't happen because largely the tree shape is not changed by changing all tuples.
How to fix
Write amplification in LSM trees is ~15. It means we create ~15GB of temporary files for each 1GB of written data. Currently we garbage collect these files only after checkpoint. If a file is entirely temporary, e.g. not referenced by any checkpoint, we can collect it without waiting for a checkpoint.
The text was updated successfully, but these errors were encountered: