Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new AOF persistence mechanism #9539

Closed
wants to merge 2 commits into from
Closed

A new AOF persistence mechanism #9539

wants to merge 2 commits into from

Conversation

chenyang8094
Copy link
Collaborator

@chenyang8094 chenyang8094 commented Sep 23, 2021

Hi, I implemented a new AOF persistence mechanism (currently it is still in the draft stage). Its main purpose is to remove the AOF rewrite buffer, which can save memory overhead during rewrite and shorten the time consumption of rewrite.

I call it the ping-pong model for the time being. The reason why I call it the ping-pong model is because it is similar to the ping-pong buffer. When one buffer is full, I can continue to write another buffer, and then exchange the two buffer without interrupting the entire processing flow.

I divide AOF into four types (BASE, PING, PONG, TEMP). The AOF generated after rewrite is called BASE type AOF (configured by appendfilename configuration item), which is similar to the rdb file and represents a certain one Snapshot of redis data at the moment. That is, the AOF of the BASE type will not write incremental commands.

Next is the PING type AOF (configured by the configuration item aof-ping-filename configuration item), which is basically the same as the current AOF concept, mainly used to store the incremental commands written, when the AOF persistence function is turned on at the time, the executed commands will be written into this type AOF one by one.

When bgrewriteaof is triggered, a new AOF will be generated at this time, which is a PONG type AOF, which will continue to store the user's incremental commands. At the same time, the original PING type AOF will no longer write any data.

When the child process completes the rewrite task (temp-rewriteaof-bg.aof is generated), in the backgroundRewriteDoneHandler, I will rename the temp-rewriteaof-bg.aof file to a new BASE type AOF, and at the same time, I will rename the PONG type AOF to PING type. At this time, BASE AOF and PING AOF together constitute the current full amount of data. It can be seen that in the entire rewrite process, no rewrite buffer is used, and no data is exchanged between the parent process and the child process.

image

The above is the success of rewrite. If the rewrite fails, we will eventually have three AOF files at the same time, namely BASE, PING and PONG, which together constitute the current full amount of data. This means that these AOF files need to be loaded in turn when redis restarts. Even worse, if bgrewrite occurs again at this time, we can no longer generate PONG2, PONG3 and other files without limitation, otherwise we will not know which files we need and their order.

Therefore, for the small probability event of rewrite failure, we will generate a new AOF file, which we temporarily call temp.aof. The next incremental command will be "simultaneously" double written to PONG type AOF and temp.aof. If this rewrite is successful, rename temp.aof to PING type AOF, and both the original PING and PONG type AOFs can be deleted. If this rewrite still fails, just delete temp.aof, the original BASE/PING/PONG still constitute the complete data.

image

Currently, this solution seems to have many problems (although the tests have already been run). For example, how to ensure the atomicity of multiple file renames. Since we did not record any meta informations for AOF files, we must use double writing and multiple renames to limit the number of AOF files. In fact, in our internal practice, we designed a meta file for the AOF file(s) to record the information (including the file name and sequence, etc.), so that we do not need to rename the newly generated AOF file. But this scheme is obviously inappropriate in the community, because once relying on the meta file, it means that it is no longer compatible with the previous redis version.

So, this is my current thinking, I don’t know if I have expressed it clearly. I think this idea is okay, but we still have some problems to solve (such as the atomic problem of multi-file modification, etc.). @redis/core-team Do you have any better suggestions?

TODO:

  • Remove rewrite buf overhead,use ping-pong write models
  • Remove the pipe and data exchange between the child process and the parent
  • The atomicity of multi-file modification
  • More accurate size statistics for multiple AOFs
  • Separately add some test cases for multi aof, such as double writing
  • Add or modify the corresponding code comment

@chenyang8094 chenyang8094 marked this pull request as draft September 23, 2021 01:09
@oranagra oranagra added this to Backlog in 7.0 via automation Sep 23, 2021
@oranagra oranagra moved this from Backlog to In progress in 7.0 Sep 23, 2021
@oranagra
Copy link
Member

@chenyang8094 did you happen to look at #6584 (i haven't), i wonder if maybe they had a solution to the rename problem or something else we can learn from.

@chenyang8094
Copy link
Collaborator Author

chenyang8094 commented Oct 11, 2021

@chenyang8094 did you happen to look at #6584 (i haven't), i wonder if maybe they had a solution to the rename problem or something else we can learn from.

Sorry to have time to reply recently. I probable read the LESS paper and the implementation of #6584. I found that these two solutions have a lot in common. For example, the RDB used in LESS is equivalent to the BASE AOF in my PR, and the AOF in LESS is equivalent to PING/PONG AOF. Among them, I carefully looked at the crash and recovery related parts in the LESS paper. In LESS, it faced the same problem as I encountered. LESS needs to rename the RDB file first, and then rename the AOF file. How to ensure these two renames As for the atomicity of operation, I did not find it in the LESS paper.

I think the paper does not give a method to recover the crash between rename rdb and raname aof. And the description of how to deal with temp aof when rewrite is retried after last rewrite fails.

image

image

@oranagra
Copy link
Member

@chenyang8094 from reading this text (didn't look at the code) i think they did intend to handle this case.
they said they first rename the temp AOF (rename "pong" into "ping") and then the temp RDB (to become the main part 1).
and if a crash would happen after the first rename and before the second, then at recovery they'll notice the temp RDB and use it.
i.e. the rename is not atomic, but the recovery process is able to detect that the first part completed and the second part didn't.

Therefore, LESS conducts a recovery using the Temporary RDB...

So if on startup we find a temp "base" (RDB) file but no "pong" file, we can conclude that "pong" was renamed, and we can rename the temp rdb to "base" and then load these two.

@hshs1103
Copy link

Hello. I was trying to check the progress of a previously posted pull request (#6584 ), and saw that my pull request was mentioned and left a post.

In the case of the paper read by @chenyang8094 , the intention of the LESS module I designed is not explained in detail (due to the limitation of the size of the conference paper). Also, since there are no separate comments in the code, I think it will be difficult to understand.

If you have any questions about LESS, please leave a comment.

*I hope that I will be able to help you with the module you are developing now.

@oranagra oranagra moved this from In progress to In Review in 7.0 Nov 2, 2021
@chenyang8094
Copy link
Collaborator Author

@chenyang8094 from reading this text (didn't look at the code) i think they did intend to handle this case. they said they first rename the temp AOF (rename "pong" into "ping") and then the temp RDB (to become the main part 1). and if a crash would happen after the first rename and before the second, then at recovery they'll notice the temp RDB and use it. i.e. the rename is not atomic, but the recovery process is able to detect that the first part completed and the second part didn't.

Therefore, LESS conducts a recovery using the Temporary RDB...

So if on startup we find a temp "base" (RDB) file but no "pong" file, we can conclude that "pong" was renamed, and we can rename the temp rdb to "base" and then load these two.

Yes, LESS does rely on the order of file rename to solve the problem of rename atomicity. According to this idea, I sorted out a possible solution, but this will bring a lot of complexity when loading data (such as judging Which files are in the current directory, and then decide which files to load), I wonder whether this complexity (including the dependency between multiple files) will cause a lot of difficulties for users to understand.
Or is it possible for us to add a meta file to record the information of the aof files? In this way, there is no need to rename PONG to PING frequently, we can always use aof.incr1, aof.incr2, aof.incr3... etc.

image

image

@oranagra
Copy link
Member

oranagra commented Nov 3, 2021

@chenyang8094 it looks complicated when you paint it like this, but i think that in fact it's just:
First load one of the base files:

  • [1] if temp-bg.aof exists, load it!
  • [2] else if appendonly.aof exists, load it!

Then if one of the base files was loaded, proceed to load the other parts:

  • [3] if appendonly.ping exists, load it.
  • [4] if appendonly.pong exists, load it!

we can add an additional action in [2] to set a recovery flag, and add a condition to [4] not to run it if the recovery flag is set, but it seems like this is not needed (since the rename of pong to ping delete it)

I haven't reviewed the code yet (will do so soon), but i assume that when you say "rewrite success" that's another rename operation that renames the temp-rewriteaof-bg-<pid>.aof to temp-bg.aof. if not maybe i'm missing something and my plan above is wrong.

I understand that a meta file can maybe make this a bit clearer, but it's also one more thing to take core of, so if the 1-4 plan above is right, i don't think it's that complicated.

as you pointed out, the other thing to worry about is for users to know which files to copy / keep.
all of this is indeed much more complicated than how it used to be till now (just one real file, and a bunch of temp files to ignore).
I must say i'm not a fan of all the fancy naming (ping / pong / LESS), and if the files where just numbered (aof1, aof2, aof3) maybe it can be more straight forward to follow. need to think of that a bit more.

another thing that feels odd to me is that in all these examples the appendonly.aof is actually a plain RDB file, so that's a bit misleading / awkward. maybe we can resolve this with a different naming scheme.

as a side note, there have been some discussions in other issues in the repo lately about complications that can occur if the user configured his server to persist to both AOF and RDB files (not knowing what to load and where to take the last repl_offset from), and it was mentioned that once we implement this PR, there's no reason to ever do that anymore (since the AOFRW overhead vanishes). so considering that, it could be a nice idea to call it appendonly-1.rdb (followed by appendonly-2.aof, and appendonly-3.aof)

P.s. there's another state that's not mentioned here, which is a server that starts up with an initial configuration to write to an AOF that didn't exist in the past. in this case, there's no RDB portion, and i suppose that in the current code of the PR, that's the only case where there's only appendonly.aof without even a ping, and that's the only case where the appendonly.aof is actually an AOF (no preamble RDB in it).
This may not be a real problem for the purpose of all the discussion above, but it does pose another problem for modules.
Some modules choose to store some metadata / configuration in the RDB aux fields, and in all other scenarios, when they recover from an AOF, they have that data, but in this scenario they don't.
For this reason, i think it may be a good idea that when the server starts up with an AOF configuration and no AOF file, it'll create an empty AOF file (part one), which is a preamble RDB with no keys.
In this case, if follow my terminology, we get appendonly-1.rdb that's always an rdb and is always present.

so bottom line, i'd suggest this scheme:
printf("%s-%d%s.%s", server.aof_filename, sequence_number, state, file_format)

  • server.aof_filename - this is the base file name the user provided in the config file
  • sequence_number - 1,2,3 (i.e. base, ping, pong)
  • state - this can be either "", or "-tmp"
  • file_format - either AOF or RDB (just so that the user knows what kind of data it holds)
    so to be clear, we have:
  • appendonly-1.rdb
  • appendonly-2.aof
  • appendonly-3.aof (when AOFRW succeeds, we rename "3" to "2")
  • appendonly-1-tmp.rdb (when AOFRW succeeds we just drop the -tmp part)
  • appendonly-2-tmp.aof (this one is used in case the previous AOFRW failed, and when renamed, we just drop the -tmp part)

@chenyang8094
Copy link
Collaborator Author

chenyang8094 commented Nov 9, 2021

@chenyang8094 it looks complicated when you paint it like this, but i think that in fact it's just: First load one of the base files:

  • [1] if temp-bg.aof exists, load it!
  • [2] else if appendonly.aof exists, load it!

Then if one of the base files was loaded, proceed to load the other parts:

  • [3] if appendonly.ping exists, load it.
  • [4] if appendonly.pong exists, load it!

we can add an additional action in [2] to set a recovery flag, and add a condition to [4] not to run it if the recovery flag is set, but it seems like this is not needed (since the rename of pong to ping delete it)

I haven't reviewed the code yet (will do so soon), but i assume that when you say "rewrite success" that's another rename operation that renames the temp-rewriteaof-bg-<pid>.aof to temp-bg.aof. if not maybe i'm missing something and my plan above is wrong.

I understand that a meta file can maybe make this a bit clearer, but it's also one more thing to take core of, so if the 1-4 plan above is right, i don't think it's that complicated.

as you pointed out, the other thing to worry about is for users to know which files to copy / keep. all of this is indeed much more complicated than how it used to be till now (just one real file, and a bunch of temp files to ignore). I must say i'm not a fan of all the fancy naming (ping / pong / LESS), and if the files where just numbered (aof1, aof2, aof3) maybe it can be more straight forward to follow. need to think of that a bit more.

another thing that feels odd to me is that in all these examples the appendonly.aof is actually a plain RDB file, so that's a bit misleading / awkward. maybe we can resolve this with a different naming scheme.

as a side note, there have been some discussions in other issues in the repo lately about complications that can occur if the user configured his server to persist to both AOF and RDB files (not knowing what to load and where to take the last repl_offset from), and it was mentioned that once we implement this PR, there's no reason to ever do that anymore (since the AOFRW overhead vanishes). so considering that, it could be a nice idea to call it appendonly-1.rdb (followed by appendonly-2.aof, and appendonly-3.aof)

P.s. there's another state that's not mentioned here, which is a server that starts up with an initial configuration to write to an AOF that didn't exist in the past. in this case, there's no RDB portion, and i suppose that in the current code of the PR, that's the only case where there's only appendonly.aof without even a ping, and that's the only case where the appendonly.aof is actually an AOF (no preamble RDB in it). This may not be a real problem for the purpose of all the discussion above, but it does pose another problem for modules. Some modules choose to store some metadata / configuration in the RDB aux fields, and in all other scenarios, when they recover from an AOF, they have that data, but in this scenario they don't. For this reason, i think it may be a good idea that when the server starts up with an AOF configuration and no AOF file, it'll create an empty AOF file (part one), which is a preamble RDB with no keys. In this case, if follow my terminology, we get appendonly-1.rdb that's always an rdb and is always present.

so bottom line, i'd suggest this scheme: printf("%s-%d%s.%s", server.aof_filename, sequence_number, state, file_format)

  • server.aof_filename - this is the base file name the user provided in the config file
  • sequence_number - 1,2,3 (i.e. base, ping, pong)
  • state - this can be either "", or "-tmp"
  • file_format - either AOF or RDB (just so that the user knows what kind of data it holds)
    so to be clear, we have:
  • appendonly-1.rdb
  • appendonly-2.aof
  • appendonly-3.aof (when AOFRW succeeds, we rename "3" to "2")
  • appendonly-1-tmp.rdb (when AOFRW succeeds we just drop the -tmp part)
  • appendonly-2-tmp.aof (this one is used in case the previous AOFRW failed, and when renamed, we just drop the -tmp part)

Indeed, it may be easier and more intuitive to use serial numbers for file naming, but these are not the core issues at present. I think aof loading are more complicated than you think. For the convenience of discussion, I directly express the core ideas as pseudo-code, because we also have a double write mechanism, so we also need to deal with temp-dw.aof:

void backgroundRewriteDoneHandler() {
   // rename the random name to a fixed name
   if rename(temp-rewriteaof-bg-xxx.aof, temp-bg.aof) == -1 {
       exit()
   }

   if cur_type == TEMP_DW { // in double write mode
       if rename(temp-dw.aof, ping.aof) == -1 {
           exit()
       }
       unlink(pong.aof)
   } else if cur_type == PONG { // in normal rewrite mode
       if rename(pong.aof, ping.aof) == -1 {
           exit()
       }
   } else {
       // no need rename
   }

   if rename(temp-bg.aof, appendonly.aof) == -1 {
       exit()
   }
}


void loadDataFromDisk() {
   //1. load base aof

   if exists temp-bg.aof {
       load temp-bg.aof
   } else if exists appendonly.aof {
       load appendonly.aof
   }

   //2. load incr aof

   if exists temp-bg.aof {
       if exists temp-dw.aof {
           load temp-dw.aof
       } else {
           if exists pong.aof {
               load pong.aof 
           } else if exists ping.aof {
               load ping.aof
           }
       }
   } else {
       if exists ping.aof {
           load ping.aof 
       } 
       
       if exists pong.aof {
           load pong.aof
       }
   }
}

@oranagra
Copy link
Member

oranagra commented Nov 9, 2021

@chenyang8094 can you save me the effort and explicitly specify what problem your pseudo code comes to demonstrate?

@chenyang8094
Copy link
Collaborator Author

chenyang8094 commented Nov 10, 2021

@chenyang8094 can you save me the effort and explicitly specify what problem your pseudo code comes to demonstrate?

First of all, we can see that the above pseudo-code is very complicated. When redis is started, it must check which files currently exist and which files should be loaded. I think this is not a small burden for users (in the face of so many strange files). At the same time, the introduction of double writing (temp-dw.aof) may also be an incomprehensible but sometimes necessary thing, which adds complexity invisibly.
Also, the loadDataFromDisk pseudo-code above cannot cover all cases. For example, temp-bg.aof was left by the last rewrite, and pong.aof(appendonly-2.aof,) was generated this time. If we still load temp-bg.aof And pong.aof(appendonly-2.aof,) is actually wrong.

The complexity of the above scheme is largely because we do not have a meta file to record and manage all our AOF files. In the absence of meta, we need to solve two problems:

  1. How to ensure that the incremental AOF will not increase indefinitely when the rewrite fails repeatedly, because we do not have a meta, so we can only remember a limited and fixed AOF name (such as appendonly-1.aof, appendonly-2.aof, etc.)
  2. How to ensure the atomicity between multiple renames in backgroundRewriteDoneHandler, which is very important for redis recovery when restarting

In order to solve the above two problems, we can have the following corresponding means:

  1. If we rewrite again after the last failed rewrite, we will no longer open appendonly-3.aof as usual, but open a temporary aof (such as temp-dw.aof), and all subsequent executed commands will be Double write to appendonly-3.aof and temp-dw.aof at the same time, so that the aof file can be limited to at most: appendonly-1.aof, appendonly-2.aof and temp-dw.aof
  2. In order to ensure the atomicity of rename (to ensure all success), we can record a redo log in the backgroundRewriteDoneHandler. Even if a crash occurs in the middle of the rename, redis will first perform the redo operation if it finds a redo log in the directory when it restarts, so we can start redis with a certain data set

Here is the pseudo code:

// appendonly.aof: base aof
// appendonly-1.aof: The first increment aof
// appendonly-2.aof: The second increment aof
// temp-dw.aof: Temporary aof for double writing
void backgroundRewriteDoneHandler() {
   // 1、write temp-aof-recovery.redo
   // for example:
   //  rename temp-rewriteaof-bg-1234.aof appendonly.aof
   //  rename temp-dw.aof appendonly-1.aof
   //  unlink(pong.aof)
   if write(temp-aof-recovery.redo, redolog) == -1 {
       exit()
   }

   // 2、rename temp-aof-recovery.redo to aof-recovery.redo
   if rename(temp-aof-recovery.redo, aof-recovery.redo) == -1 {
       exit()
   }

   // 3、rename incr aof
   if cur_type == TEMP_DW { // in double write mode
       if rename(temp-dw.aof, appendonly-1.aof) == -1 {
           exit()
       }
       unlink(pong.aof)
   } else { // in normal rewrite mode
       if rename(appendonly-2.aof, appendonly-1.aof) == -1 {
           exit()
       }
   } 

    // 4、rename base aof
   if rename(temp-rewriteaof-bg-xxx.aof, appendonly.aof) == -1 {
       exit()
   }

    // 5、del redo log
   unlink(aof-recovery.redo)
}


void loadDataFromDisk() {
   // 1、redo rename operations
   if exists aof-recovery.redo {
       redo the rename operations
       if redo failed {
           exit()
       }
       if unlink(aof-recovery.redo) == -1 {
           exit()
       }
   }

   // 2、if we redo succeed, we should have appendonly.aof appendonly-1.aof appendonly-2.aof
    if exists appendonly.aof {
       load appendonly.aof
    }

    if exists appendonly-1.aof {
        load appendonly-1.aof
    } 
    
    if exists appendonly-2.aof {
        load appendonly-2.aof
    }
}

I think the redo-based non-meta scheme is theoretically feasible, but it has two disadvantages:

  1. It is still necessary to handle double write when rewrite fails
  2. It is not conducive to operation and maintenance. In the worst case, there will be these files in the redis directory: appendonly.aof, appendonly-1.aof, appendonly-2.aof, temp-dw.aof, temp-aof-recovery. redo, temp-rewriteaof-bg-xxx.aof, I think this still has a lot of complexity in operation and maintenance

Based on the above, I will explain the scheme with meta below (although my current code is written in accordance with no meta), I think the introduction of a meta will make things a lot easier. Let me make a name convention first:

  1. Meta file: used to record the information of the current AOFs, the name of the meta file is tentatively set to be aof.list (or aof.manifest etc)
  2. Base aof: The naming rule is basename_b_sequence, such as appendonly.aof_b_1, appendonly.aof_b_2, appendonly.aof_b_3, etc.
  3. Incremental aof: The naming rule is basename_i_sequence, such as appendonly.aof_i_1, appendonly.aof_i_2, appendonly.aof_i_3, etc.

In the meta file, we will record the names of all aof files opened by redis and the types of aof. There are three main types: b (base aof), h (history aof), n (new incr aof)

Below I will use a few simple pictures to illustrate the idea of the reform plan:

Suppose the current state is as follows:
there is a base aof: appendonly.aof_b_1 in the current directory, and an incremental aof: appendonly.aof_i_1. At the same time, aof.list records the meta-information of the current AOFs.

image

Next, start to perform a rewrite, there will be two steps at the beginning:

  1. Create and open a new incremental file: appendonly.aof_i_2
  2. Update the aof.list file, add appendonly.aof_i_2 information, note that its type is n at this time

Note: If any of the above two steps fails, the redis process will exit, even if redis crashes between 1 and 2 (such as accidentally killed, etc.), the data is correct when redis restarts, the side effect is that there is an invalid aof file in the directory (appendonly.aof_2).

image

During the rewrite, the executed commands will be written into appendonly.aof_i_2. After the rewrite is over, the child process will generate a temp-rewriteaof-bg-xxx.aof file. In the backgroundRewriteDoneHandler, we need to process the following operations:

  1. rename temp-rewriteaof-bg-xxx.aof to appendonly.aof_b_2
  2. create a temporary Meta file: temp-aof.list, write the following information to it:
    image
  3. rename temp-aof.list to aof.list.
  4. redis cron will periodically clean up the aof files whose type is h in aof.list (including the old base aof and the old incr aof)

Here is the pseudo code:

// current aof.list
// content:
// appendonly.aof_b_1      b              
// appendonly.aof_i_1      n    
// appendonly.aof_i_2      n 
void backgroundRewriteDoneHandler() {
   // 1、rename temp-rewriteaof-bg-xxx.aof to appendonly.aof_b_2
   if rename(temp-rewriteaof-bg-xxx.aof, appendonly.aof_b_2) == -1 {
       exit()
   }

   // 2、write temp-aof.list
   // content:
   // appendonly.aof_b_2      b 
   // appendonly.aof_b_1      h              
   // appendonly.aof_i_1      h    
   // appendonly.aof_i_2      n     
   if write(temp-aof.list, content) == -1 {
       exit()
   }

   // 3、rename temp-aof.list to aof.list
   if rename(temp-aof.list, aof.list) == -1 {
       exit()
   }
}


void loadDataFromDisk() {
   if exists aof.list {
       for aof in aof.list {
           if aof.type != h {
               load aof.name
           }
       }
   } else {
      // Compatible with old versions redis
      load appendonly.aof
   }
}

In summary, when there is a meta file, we get two benefits:

  1. We no longer need double writing after rewrite failed, because we can open new aof happily every time, and we have a meta record of which we opened and their order
  2. We use the atomicity of meta rename to easily solve the recovery problems encountered by non-meta solutions

Comparing the two solutions I just mentioned (with meta and without meta), I prefer the design with meta (although the code I submitted before was based on the design without meta, but I think it is still very imperfect),It will also make redis look more like a modern database in terms of persistence., @oranagra @redis/core-team What do you think? I think we need to quickly determine which solution to adopt, and then I can write code to implement it according to this goal.

@oranagra
Copy link
Member

In summary, when there is a meta file, we get two benefits:

  • We no longer need double writing after rewrite failed, because we can open new aof happily every time, and we have a meta record of which we opened and their order

correct me if i'm wrong, but this advantage or disadvantage it not directly related to meta or not. IIRC we created the double write mechanism in order to make sure that if AOFRW repeatedly fails, we don't create more and more files on the disk (file count remains capped to some 4).
i.e. this means that compared to 6.2, when AOFRW is successful (normal operation), we don't have double write or AOF buffers in ram. and when it fails, we still have double writes, but at least got rid of the RAM buffer.

It seems to me that the meta file design is not really that much simpler than the other one, involving cron and 'h' files.
I think the redo file idea is a nice one and we can proceed with it.
and i think the chance that any of our users will ever see that redo-file or face a state where the renaming process was interrupted by a crash or power failure in the middle are quite slim.
it does complicate redis code, and we'll need a mechanism to test it, but i don't think that actual users will ever face it.

i think that with the naming scheme i described above (printf("%s-%d%s.%s", server.aof_filename, sequence_number, state, file_format)), and either a redo file mechanism or one that can just figure out which renames to re-do when it recovers from a crash, it'll be easy for users to know how to manage their AOF files.

if we agree that both designs are valid, the pseudo code you wrote for either of them doesn't seem too complicated for redis IMHO, and my main concern is how users be able to easily manage these files.

let's wait for more feedback and make a decision.

@soloestoy
Copy link
Collaborator

seems we have two methods to eliminate the double I/O-write and double rewrite-buffer:

  1. using some fixed name files, but file system doesn't support atomically rename two files, need add a redo file to record the status to make sure recover correctly, and still have to double I/O-write if background rewrite fail.
  2. add an meta file to record all files(file name can be arbitrary), don't need double I/O-write any more.

IMHO a meta file is easy to understand since we can have a global information file and lots of databases take this way.

ping @redis/core-team (and maybe other users if you want) need more feedback, our goal is to eliminate double I/O-write and double rewrite-buffer, and at the same time we need make it easy to implement(from redis developer's POV) and maintain AOF files(from redis users' POV).

@chenyang8094
Copy link
Collaborator Author

In summary, when there is a meta file, we get two benefits:

  • We no longer need double writing after rewrite failed, because we can open new aof happily every time, and we have a meta record of which we opened and their order

correct me if i'm wrong, but this advantage or disadvantage it not directly related to meta or not. IIRC we created the double write mechanism in order to make sure that if AOFRW repeatedly fails, we don't create more and more files on the disk (file count remains capped to some 4). i.e. this means that compared to 6.2, when AOFRW is successful (normal operation), we don't have double write or AOF buffers in ram. and when it fails, we still have double writes, but at least got rid of the RAM buffer.

It seems to me that the meta file design is not really that much simpler than the other one, involving cron and 'h' files. I think the redo file idea is a nice one and we can proceed with it. and i think the chance that any of our users will ever see that redo-file or face a state where the renaming process was interrupted by a crash or power failure in the middle are quite slim. it does complicate redis code, and we'll need a mechanism to test it, but i don't think that actual users will ever face it.

i think that with the naming scheme i described above (printf("%s-%d%s.%s", server.aof_filename, sequence_number, state, file_format)), and either a redo file mechanism or one that can just figure out which renames to re-do when it recovers from a crash, it'll be easy for users to know how to manage their AOF files.

if we agree that both designs are valid, the pseudo code you wrote for either of them doesn't seem too complicated for redis IMHO, and my main concern is how users be able to easily manage these files.

let's wait for more feedback and make a decision.

I think the redo-based meta-free solution and the meta-based solution can ultimately achieve our goals, but as far as I know, many companies will back up redis. At present, they generally upload RDB file, but once we divide the AOF into BASE + INCRs, I believe that many users will back up the BASE AOF and INCR AOF(s) one by one (for example, to implement PITR). If we have a meta file, these external systems have a global perspective, and they know which AOF files in the current directory are valid, what is their order. Based on redo's non-meta design, they will be able to see some intermediate state AOF(s) and redo logs. They cannot execute redo at restart to reach a consistent state like redis, so for them, they can only back up some the intermediate state of AOF(s)(a bunch of files starting with temp-), which is obviously unfriendly to them.

@oranagra I think it is worthwhile to bring a clearer state at the cost of adding a meta file. Of course, this is just my idea. I am eager to get feedback from core-team and other interested parties.

@oranagra
Copy link
Member

I had a lengthy discussion about the alternatives and future plans with @yossigo and @yoav-steinberg, we concluded that it would indeed be better to take the meta-file approach, mainly since it'll allow more flexibility in the future.

features like keeping historical AOF files for longer so a user can restore the db to an older state (before the last rewrite), this is in line with #9326.
to be clear, i don't want to do that now, but just stating that it will allow such flexibility.

I still think it would be a bad idea to create infinite number of files on the disk if AOFRW is repeatedly failing, so i still think we should set the threshold of switching to the double write solution after just one failure (don't see a reason to set that threshold to 2 or 5).

Other random notes from the discussion:

for this PR

  • on upgrades from old redis versions, we need to detect the old AOF file and rename it / create a meta file.
  • I think we should keep the config changes here to minimum, and consider the current (old) appenfilename config as the base file name (maybe trim the .aof suffix), and use that base file name for all the files we create (base part, incremental parts, temp parts, and metadata file).
  • There was a discussion about whether or not this new feature (multi-part AOF) should be configurable, so that admins or platforms that wanna upgrade to redis 7 and enjoy other features it offers, but can't be bothered to do the extra work to adjust their things to the new way, will have a way to disable it.
    we think this will be too complex for redis internals (and the developers state of mind) to keep both of these side by side, and concluded we're probably ok to completely forget the old way.

for a followup PR:

  • i think we can / should now (in a followup PR maybe) deprecate / delete the aof-use-rdb-preamble
  • i think we should now (in a followup PR) delete the entire mechanism of rewriteAppendOnlyFileRio (e.g. rewriteStreamObject etc.), and deprecate the aof_rewrite module API callback (which causes many modules hard time to implement).
  • i think that when redis starts up configured to persist to AOF, and there's no file on the disk, it should create an empty base (rdb) file (using a foreground / blocking rdbSave), which will also let modules put their aux data in it (so module that are recovering from persistence will always have their aux data available, possibly containing the old config)

For some future version:

  • in some future day (possibly only in redis 7.2), we'll want to completely consolidate the BGSAVE rdb snapshots with AOFRW, so that any bgsave (even one that was created for a replica), can serve as an AOFRW too.

@redis/core-team please respond if you wanna argue about any of the above.

@chenyang8094
Copy link
Collaborator Author

I had a lengthy discussion about the alternatives and future plans with @yossigo and @yoav-steinberg, we concluded that it would indeed be better to take the meta-file approach, mainly since it'll allow more flexibility in the future.

features like keeping historical AOF files for longer so a user can restore the db to an older state (before the last rewrite), this is in line with #9326. to be clear, i don't want to do that now, but just stating that it will allow such flexibility.

I still think it would be a bad idea to create infinite number of files on the disk if AOFRW is repeatedly failing, so i still think we should set the threshold of switching to the double write solution after just one failure (don't see a reason to set that threshold to 2 or 5).

Other random notes from the discussion:

for this PR

  • on upgrades from old redis versions, we need to detect the old AOF file and rename it / create a meta file.
  • I think we should keep the config changes here to minimum, and consider the current (old) appenfilename config as the base file name (maybe trim the .aof suffix), and use that base file name for all the files we create (base part, incremental parts, temp parts, and metadata file).
  • There was a discussion about whether or not this new feature (multi-part AOF) should be configurable, so that admins or platforms that wanna upgrade to redis 7 and enjoy other features it offers, but can't be bothered to do the extra work to adjust their things to the new way, will have a way to disable it.
    we think this will be too complex for redis internals (and the developers state of mind) to keep both of these side by side, and concluded we're probably ok to completely forget the old way.

for a followup PR:

  • i think we can / should now (in a followup PR maybe) deprecate / delete the aof-use-rdb-preamble
  • i think we should now (in a followup PR) delete the entire mechanism of rewriteAppendOnlyFileRio (e.g. rewriteStreamObject etc.), and deprecate the aof_rewrite module API callback (which causes many modules hard time to implement).
  • i think that when redis starts up configured to persist to AOF, and there's no file on the disk, it should create an empty base (rdb) file (using a foreground / blocking rdbSave), which will also let modules put their aux data in it (so module that are recovering from persistence will always have their aux data available, possibly containing the old config)

For some future version:

  • in some future day (possibly only in redis 7.2), we'll want to completely consolidate the BGSAVE rdb snapshots with AOFRW, so that any bgsave (even one that was created for a replica), can serve as an AOFRW too.

@redis/core-team please respond if you wanna argue about any of the above.

I very much agree with the several conclusions you made about this PR(1、can upgrades from old redis versions. 2、minimum the config changes. 3、completely forget the old way,etc.). This is also the key point of my previous thinking.

I think another point is that the content of the aof meta file needs to be well designed (even some redundant fields are reserved) , so that we can obtain better flexibility when implementing other functions (such as PITR) in the future. I will implement a meta-based version as soon as possible so that we can discuss more detailed issues.

@oranagra
Copy link
Member

Closing in favor of #9788

@oranagra oranagra closed this Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants