-
Notifications
You must be signed in to change notification settings - Fork 23.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Multi Part AOF mechanism to avoid AOFRW overheads. #9788
Implement Multi Part AOF mechanism to avoid AOFRW overheads. #9788
Conversation
@chenyang8094 Just out of curiosity, does this create a foundation to extend #8015 to AOF-only and not only RDB-only setups? Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenyang8094 Thank you.
I started reviewing the PR, and also edited the PR top comment to make it clearer (please check if it looks right).
So far i reviewed just parts of it (namely all the files excluding aof.c and the tests).
I.d like to post my comments for discussion before i proceed with the other parts, to see if we agree on things, or maybe i have a misunderstanding.
Sure, you can see some followup PR plans mentioned by oranagra herer. |
I've just created a few new issues to track these ideas (don't like them sitting around in a comment in a closed PR) The other ones might not make it for this release, but we can at least start discussion and mark them for next release: |
@oranagra About upgrades from old redis versions, I think we can't directly rename the old AOF to the new name, because we still can't solve the atomic problem of rename and update manifest. We can write the old AOF name directly into the manifest, so that we can start loading normally. Even if the process crashes after writing the manifest, that is no problem, because we can find the AOF from the manifest. Once AOFRW is executed once, all files will have new file names. (current code is implemented according to this idea) |
ok, sounds good to me. so the AOF file (possibly with preamble content) will have it's old name, and will be listed in the manifest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done reviewing aof.c, didn't review the tests yet.
Change aofRewriteLimited and Update TCL test
🎉 more than 400 comments 2500 LOC, and some 3 months (counting form #9539) |
Thank you, now I am going to deal with the next PR. |
@chenyang8094 can you please look into this failure:
another one: |
i also saw it in my daily. so maybe change it to 1KB, or do a for loop and insert some keys (maybe at least 20000 keys) |
maybe.. maybe instead of using |
it is easy to reproduce if we delete all the like in this commit (7d4f50f), if we change the 20000 to be smaller, it will fail because not able to touch 1MB. (i can trigger a CI in FreeBSD and see the result, but i think this is the right fix, the but the time it should be similar in FreeBSD, because we need to generate that many keys all the time (successful use case too)) ci: https://github.com/enjoy-binbin/redis/runs/4912825635?check_suite_focus=true#step:4:3028
btw, i think it is not enought, the |
I think we can change |
In redis#9788, now we stores all persistent append-only files in a dedicated directory. The name of the directory is determined by the appenddirname configuration parameter in redis.conf Update create-cluster clean to clean this default directory. Fixes redis#10222
) Implement Multi-Part AOF mechanism to avoid overheads during AOFRW. Introducing a folder with multiple AOF files tracked by a manifest file. The main issues with the the original AOFRW mechanism are: * buffering of commands that are processed during rewrite (consuming a lot of RAM) * freezes of the main process when the AOFRW completes to drain the remaining part of the buffer and fsync it. * double disk IO for the data that arrives during AOFRW (had to be written to both the old and new AOF files) The main modifications of this PR: 1. Remove the AOF rewrite buffer and related code. 2. Divide the AOF into multiple files, they are classified as two types, one is the the `BASE` type, it represents the full amount of data (Maybe AOF or RDB format) after each AOFRW, there is only one `BASE` file at most. The second is `INCR` type, may have more than one. They represent the incremental commands since the last AOFRW. 3. Use a AOF manifest file to record and manage these AOF files mentioned above. 4. The original configuration of `appendfilename` will be the base part of the new file name, for example: `appendonly.aof.1.base.rdb` and `appendonly.aof.2.incr.aof` 5. Add manifest-related TCL tests, and modified some existing tests that depend on the `appendfilename` 6. Remove the `aof_rewrite_buffer_length` field in info. 7. Add `aof-disable-auto-gc` configuration. By default we're automatically deleting HISTORY type AOFs. It also gives users the opportunity to preserve the history AOFs. just for testing use now. 8. Add AOFRW limiting measure. When the AOFRW failures reaches the threshold (3 times now), we will delay the execution of the next AOFRW by 1 minute. If the next AOFRW also fails, it will be delayed by 2 minutes. The next is 4, 8, 16, the maximum delay is 60 minutes (1 hour). During the limit period, we can still use the 'bgrewriteaof' command to execute AOFRW immediately. 9. Support upgrade (load) data from old version redis. 10. Add `appenddirname` configuration, as the directory name of the append only files. All AOF files and manifest file will be placed in this directory. 11. Only the last AOF file (BASE or INCR) can be truncated. Otherwise redis will exit even if `aof-load-truncated` is enabled. Co-authored-by: Oran Agra <oran@redislabs.com>
In #9788, now we stores all persistent append-only files in a dedicated directory. The name of the directory is determined by the appenddirname configuration parameter in redis.conf. Now each node have a separate folder. Update create-cluster clean to clean this default directory.
@@ -16,7 +16,8 @@ doc-tools | |||
release | |||
misc/* | |||
src/release.h | |||
appendonly.aof | |||
appendonly.aof.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be appendonly.aof*
instead of appendonly.aof.*
now, appendonly.aof
will not be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you're right, but the current implementation doesn't create a file named appendonly.aof
unless you manually copy an old-style AOF for upgrade testing. So I think we just need to ignore the file we're going to create. @oranagra WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's a good idea to ignore both.
i.e. both the ones we create, and the ones we created in the past, that are still left in the folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only developers pay attention to .gitignore
, and when I updated the unstable brance
code, appendonly.aof
caught my attention, so I started this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oranagra @yangbodong22011 Sounds reasonable, I'll make a PR to add it, thanks.
…YNC_ALWAYS This PR fix two bugs: 1. About the MP-AOF one, that code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with multipart-aof, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue SELECT, and i suppose there's no reason to update any of the fsync metrics. This should have been deleted with MP-AOF (introduced in redis#9788, 7.0). The aof_fsync_offset it updates will cause us to miss an fsync in flushAppendOnlyFile, if we stop write commands in AOF_FSYNC_EVERYSEC 2. About the AOF_FSYNC_ALWAYS one, it is follow redis#6053 (the details is in redis#5985), we also check AOF_FSYNC_ALWAYS to prevent users from switching from everysec to always. We detected these problems with WAITAOF test (redis 7.2), introduced in redis#11713.
…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced.
…SYNC_ALWAYS (redis#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in redis#11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in redis#9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow redis#6053 (the details is in redis#5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced. (cherry picked from commit cb17178)
…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced. (cherry picked from commit cb17178)
Implement Multi-Part AOF mechanism to avoid overheads during AOFRW.
Introducing a folder with multiple AOF files tracked by a manifest file.
This PR implements the Meta-File Based Multi-Part AOF mechanism described in 9539.
The main issues with the the original AOFRW mechanism are:
The main modifications of this PR:
BASE
type, it represents the full amount of data (Maybe AOF or RDB format) after each AOFRW, there is only oneBASE
file at most. The second isINCR
type, may have more than one. They represent the incremental commands since the last AOFRW.appendfilename
will be the base part of the new file name, for example:appendonly.aof.1.base.rdb
andappendonly.aof.2.incr.aof
appendfilename
aof_rewrite_buffer_length
field in info.aof-disable-auto-gc
configuration. By default we're automatically deleting HISTORY type AOFs. It also gives users the opportunity to preserve the history AOFs. just for testing use now.of the next AOFRW by 1 minute. If the next AOFRW also fails, it will be delayed by 2 minutes. The next is 4, 8, 16, the maximum delay is 60 minutes (1 hour). During the limit period, we can still use the 'bgrewriteaof' command to execute AOFRW immediately.
appenddirname
configuration, as the directory name of the append only files. All AOF files and manifest file will be placed in this directory.aof-load-truncated
is enabled.To Do: