Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible for a master to restart from AOF and still be able to serve PSYNC #9796

Open
oranagra opened this issue Nov 17, 2021 · 13 comments

Comments

@oranagra
Copy link
Member

oranagra commented Nov 17, 2021

We want to make it possible for a master to restart from AOF and still be able to serve PSYNC.
see #8015

In theory, we could either check the the offset in the preamble rdb header has a chance to psync before loading the rdb and AOF tail (see #10523), and then skip the AOF tail and psync instead.
Or we can save (maybe periodically) replication offsets in the AOF file.

@oranagra oranagra created this issue from a note in 7.0 (To Do) Nov 17, 2021
@eduardobr
Copy link
Contributor

@oranagra
Is there a vision of how the implementation may look like?
Is the new manifest file only there to help managing files?

Sometimes I think that an AOF can have metadata just added as any other command - as long as this metadata is useful only after AOF is loaded, and replication-id is one of them. You could even write many times, and only the latest is taken in consideration. As if we had a internal metadata store and you "SETMETA key value" in the AOF.

@oranagra
Copy link
Member Author

No concrete design yet, but what you suggested can some day be done using annotations:
#9325
#9282

@eduardobr
Copy link
Contributor

eduardobr commented Sep 4, 2022

Considering that an AOF without tail/incr is just like an RDB, we could support this feature reusing the logic for RDB.

Also solves #10523

Suggestion:

  • When AOF on and RDB off, do blocking AOFRW on shutdown, respecting shutdown flags. It’s same as a SAVE, but ends up as base AOF (RDB) instead of dump.rdb.
  • On startup, if AOF on and no AOF tail/incr (meaning graceful shutdown), restore same way as RDB, and get possible partial sync.
  • If there’s an AOF tail, then it's not a graceful shutdown, so do nothing different for now, but improve it at a later point.

Now, a blocking SAVE/AOFRW on shutdown is really bad as it leaves clients timing out and not really knowing what to do (main reason I added the shutdown-on-sigterm/int) so we would be in a better position if that was improved first.
It would be better if Redis at least returned a proper status while shutdown is in progress, instead of blocking.
The blocking shutdown can cause any platform probing the Redis instance to think it's dead or unhealthy and try to force kill it.

An alternative to all of this: store the replication information in some kind of specialized file.

EDIT:
@oranagra, here #11076 you mention

We could even make that SHUTDOWN SAVE a non-blocking one (in the spirit of #9872, we can initiate an AOFRW, keep processing traffic, even write commands, and then terminate when the rewrite is completed).

Maybe that's the answer for the blocking shutdown issue, but I don't really understand how it works.

@oranagra
Copy link
Member Author

oranagra commented Sep 5, 2022

@eduardobr your suggestion for doing an AOFRW on shutdown and then detecting that it has no incremental part in order to do a PSYNC seems valid, but maybe we need to let the user choose if it should be enabled.
in some cases the incremental part of the AOF file can be huge, and it makes sense to spend a little time on shutdown in order to load faster on startup, but in some cases it doesn't? it depends on how long is the tail.. maybe we can make a configurable threshold that will by default make the right decision.

regarding non-blocking shutdown, the idea was to do an AOFRW during termination, keep serving writes and store them into an incremental file.
so at the end we have a relatively small incremental part.
if we want that to also serve for psync, we need to sign the replication offset somewhere.
this can be either periodically in the incremental part of the AOF (to also assist on crashes), or maybe just a single offset in the manifest file in case of a graceful shutdown.

@eduardobr
Copy link
Contributor

I'm afraid that trying PSYNC after a crash could lead to some complication. Better go for a FULLSYNC in these cases and be safe?
I just looked the manifest file format, and the fact that it declares the type of information as first "column" is great. We could store offset and replicationid there.
What's the consequence in case of a crash, where we would reload some old offset info?
If using the manifest, should the replid and offset be removed on startup and saved on shutdown?

About letting the user choose if it AOFRW should be enabled on shutdown. Could it be the shutdown-on-sigint/term save option? So if AOF only enabled, will do a BGREWRITEAOF on shutdown, if RDB only enabled, do what it does today.
That's breaking, but what's the use case for AOF + creating RDB on shutdown?

@oranagra
Copy link
Member Author

oranagra commented Sep 5, 2022

regarding the manifest. maybe we can add a hint that includes both replication offset / id, and also the size of the incremental file it corresponded to.. so the moment we append anything to the incremental file, that data in the manifest becomes invalid even without requiring to trim it from there.
i'm not certain if this should be in a column, or maybe some other special notation (e.g. a line that starts with # or something).
@chenyang8094 please advise.

Regarding AOFRW on shutdown, i rather not change the meaning of existing interfaces, and i'm also not certain users should choose this behavior explicitly.
I think we can attempt to make redis do the right thing by default. concluding that the AOF file is too long on shutdown and implicitly deciding to rewrite it (unless maybe the NOSAVE or FORCE arguments are present).

That's breaking, but what's the use case for AOF + creating RDB on shutdown?

I think there is a use case, which is faster restart (from RDB), and enabling AOF later.
it depends if we wanna do this as a small incremental improvement, or tackle the bigger thing which is to completely unify RDB snapshots and AOF (#9795), which requires a bit more planning.

@eduardobr
Copy link
Contributor

eduardobr commented Sep 5, 2022

Sounds good the idea of invalidating the offset as soon as there's another append.
But AOFRW only if the file is too long seems to me an option that will almost never enable PSYNC in a lot of setups. A conscious user will already set the percentage threshold to keep the AOF compact.
Maybe an AOFSAVE flag to be totally non-breaking and opt-in? (While NOSAVE continues to be the option to disable any kind of save on shutdown)

I think there is a use case, which is faster restart (from RDB), and enabling AOF later.

Oh, but that's what we also get on the AOFRW

But if this is added to 7.2, isn't it acceptable to change the behavior and always AOFRW by default on shutdown? (with NOSAVE as opt-out). Also one step closer to AOF and RDB consolidation, because for RDB we already always save on shutdown. One of them would have to break to consolidate.

@oranagra
Copy link
Member Author

oranagra commented Sep 5, 2022

regarding the first topic, i think we can add the replication offset notation to the manifest on normal shutdown with AOF (not only if we do an AOFRW during shutdown), so it'll work either way....

regarding the other comments, i'll try to respond later, currently out of time.

@oranagra
Copy link
Member Author

oranagra commented Sep 6, 2022

Oh, but that's what we also get on the AOFRW

yes, but maybe that's a breaking change.. maybe someone does SHUTDOWN SAVE, has some script to copy the RDB file to another machine, and start redis there.
i rather not change the meaning of the SAVE flag, unless it comes together with a much bigger topic of consolidating snapshots and AOF into one.

@chenyang8094
Copy link
Collaborator

chenyang8094 commented Sep 9, 2022

@oranagra Do we need to record offset and replid for each AOF(base and all incrs)?, IIUC, we just need to record for base AOF (rdb), if so, I think it's better to do it with #(recorded at the very beginning of the manifest) . Otherwise, it would be better to do it in columns if required for every AOF file.

@oranagra
Copy link
Member Author

I think we just need to record the replication offset and Id of of the tail of the last incr file (and it's size).
So It could be:

# last-size <last incr file size> repl-id <repl id> repl-ofs <repl offset> 

Then when we generate the manifest file again (inserting a new incr or base file) we delete that record (which is why it doesn't have to mention which incr file it refers to).

When loading it, we ignore the replication data if the last incr file is not at the right size.
This mechanism can work with both an aofrw at shutdown, also with just an aof flush at shutdown.

@eduardobr
Copy link
Contributor

@oranagra, hi years later =)
I remember I have drafted your last comment and made it work (just needed good validation when loading the manifest and written tests), would be a waste to never bring it back.

I would like to know how welcome a change like this is currently, and if the project of AOF and RDB consolidation has moved since then as a better approach.
Cheers

@oranagra
Copy link
Member Author

@eduardobr i'm currently focused elsewhere so a big project of promoting consolidation of AOF and RDB isn't on my table, and AFAIR we didn't push this forward since the multi-part AOF work in 7.0.
if it is a small incremental change that's isolated from other things, we can consider it for 8.0.

i suppose trying to psync from the offset we have a the base (and then if it fails, consider loading the AOF tail), is not such a change.
but maybe writing an offset at graceful termination could be one. i.e. either to the manifest, or as a comment at the end of the last incr file.
i can't afford to deep dive into it right now, but if it makes sense to you and you conclude it's straight forward and doesn't have any down-sides compared to what we have today, you can proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants