Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSYNC2: make partial sync possible after master reboot #8015

Merged
merged 7 commits into from Sep 13, 2021

Conversation

soloestoy
Copy link
Collaborator

@soloestoy soloestoy commented Nov 4, 2020

This PR aims to solve issue #6030, about making psync possible after master reboot, at first I wrote some codes in #6034 but it's not good enough, after a discussion with @oranagra , I modify the code(about the expire problem) and open the new one, here ping @redis/core-team I'd like to have your suggestions.

The main idea is how to allow a master to load replication info from RDB file when rebooting, if master can load replication info it means that replicas may have the chance to psync with master, it can save much traffic.

The key point is we need guarantee safety and consistency, so there
are two differences between master and replica:

  1. master would load the replication info as secondary ID and
    offset, in case other masters have the same replid.
  2. when master loading RDB, it would propagate expired keys as DEL
    command to replication backlog, then replica can receive these
    commands to delete stale keys.
    p.s. the expired keys when RDB loading is useful for users, so
    we show it as rdb_last_load_keys_expired and rdb_last_load_keys_loaded in info persistence.

Moreover, after load replication info, master should update
no_replica_time in case loading RDB cost too long time.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soloestoy i think we better have a test for that.
@guybe7 please have a look.

(adding major decision due to the new info field)

src/rdb.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
@oranagra oranagra added the state:major-decision Requires core team consensus label Nov 4, 2020
src/server.c Outdated Show resolved Hide resolved
@trevor211
Copy link
Collaborator

I think this PR tries to solve partial sync when master reboots before replicas do failover.
While it is safe to save rsi.repl_id to server.replid2,it is doomed to do a full sync if one of its replicas got promoted.
If we save rsi.repl_id to server.replid as before, it is possible to trigger a partial sync.
Non of the them is perfect, but we needs a decision.

@ShooterIT
Copy link
Collaborator

ShooterIT commented Dec 17, 2020

Hi @oranagra @soloestoy I think that's a great idea, especially, we have changed 'random sampling garbage collector' to 'consecutive garbage collector' in activeExpireCycle. It has much chance to implement partial resynchronizations if restarting time is less than ttl of all keys. I think we should push it forward.

I think what oran said makes sense, maybe we can refactor master behavior. I notice we don't create master replid.

/* If no expired key is deleted when load RDB, we may have chance to
 * restore replication context to implement partial resynchronizations. */
if (!server.rdb_expired_keys_last_load) {
    memcpy(server.replid,rsi.repl_id,sizeof(server.replid));
    server.master_repl_offset = rsi.repl_offset;
    if (!iAmMaster()) {
        /* If this is a replica, create a cached master from this
         * information, in order to allow partial resynchronizations
         * with masters, like REPLICAOF ip port. */
        replicationCacheMasterUsingMyself();
        selectDb(server.cached_master,rsi.repl_stream_db);
    } else {
        /* If this is a master, shift the current replication ID and
         * offset as secondary replication ID to allow replicas to partial
         * resynchronizations with masters, like REPLICAOF NO ONE. */
        shiftReplicationId();
        createReplicationBacklog();
        server.repl_no_slaves_since = time(NULL);
    }
}

Just FYI, no test :)

@ShooterIT
Copy link
Collaborator

To avoid expired keys to make replication context invalid, could we add a config delay-expire-after-start?
In most cases, HA component(such as config server, sentinel, or cluster mechanism) will finish failover when master restarts, so I think it is more possible that after restarting, the master will be changed into slave role later. We can delay to expire keys to keep replication context valid to implement partial resynchronizations.

@oranagra
Copy link
Member

@soloestoy seems we have forgotten this PR although i don't see any major issues that were discussed. let's pick it up again and finish it for 7.0?

Regarding the last comment by @ShooterIT, i don't like to add that config, but maybe we can write to the replication backlog when we're expiring keys during rdb load, this way if there weren't too many expired keys, replicas will still be able to psync.

@oranagra oranagra added this to Backlog in 7.0 via automation Jul 22, 2021
@oranagra oranagra moved this from Backlog to To Do in 7.0 Jul 22, 2021
@soloestoy
Copy link
Collaborator Author

@oranagra I will rebase this PR.

but maybe we can write to the replication backlog when we're expiring keys during rdb load, this way if there weren't too many expired keys, replicas will still be able to psync.

I like this approach.

@soloestoy
Copy link
Collaborator Author

but maybe we can write to the replication backlog when we're expiring keys during rdb load, this way if there weren't too many expired keys, replicas will still be able to psync.

@oranagra Sadly, this cannot work, because PING command received by replica may disturb the offset, lead to inconsistency between replica and rebooted master.

If we can make PING out of replication(I did it in PR #8440), this approach can work.

@oranagra
Copy link
Member

the server just restarted, if there are any slaves already connected, they won't be in an ONLINE state yet.
so i think we can change replicationCron to avoid polluting the backlog at all.
but also note that PSYNC is not an ok-loading command, so i don't think we'll have any replicas at all, i think the code is already ok in that respect...
am i missing something? if so, i think it is probably solvable (without a generic solution for separating the PINGs from the repl-stream)

@soloestoy soloestoy force-pushed the psync2-master-load-replinfo branch from 9e82c2f to 8b5bab2 Compare July 22, 2021 09:43
@soloestoy
Copy link
Collaborator Author

soloestoy commented Jul 22, 2021

@oranagra I mean the PING command before restart:

Time 1. master offset is 100 without PING and generate an RDB with offset 100
Time 2. master send PING to replica, offset grows to 114
Time 3. master restart from RDB and the initial offset is 100
Time 4. master delete stale keys and write to replication backlog during RDB loading, and offset may grows greater than 114
Time 5. after master restarted, replicate use PSYNC repl_id 114 to connect master, it can work avoid FULLRESYNC, but data (offset) is wrong(data from 100 to 114 lost).

@oranagra
Copy link
Member

@soloestoy that's right, but what's the difference between that case and the case in which we don't DEL commands on behalf of expired keys? i.e. a master restarts from rdb file, then receives a bunch of commands from clients before the replica attempts to PSYNC.
i suppose the solution for that is that we shiftReplicationId before putting things into the backlog, so we should do that before adding the DEL commands too, and the problem you described is solved.

besides, if i understand correctly, the main issue that this PR comes to solve is about a graceful restart of the master (one that saves an rdb file on shutdown), so in that case the problem you described doesn't exist.
this means that we can add an AUX hint to the rdb saying that it's an rdb file of a graceful shutdown and only resurrect the replication id in that case (but as noted above, i don't think that's necessary).

p.s. another improvement to that problem, and possibly to other problems is that when the server goes down gracefully, it stores the replication backlog into the rdb file (and both replid1 and replid2 and their offsets).
then on a restart replicas are able to sync even if they where dropped before the graceful shutdown.
If we take this further, we may even wanna store the replication backlog in the rdb file we send to replicas, so a replica that just synced and immediately after gets promoted, can serve PSYNC from other replicas that have older replids.

@soloestoy
Copy link
Collaborator Author

@oranagra thanks, I got it.

@eduardobr
Copy link
Contributor

If both AOF and RDB are enabled and a replica - or master in the case of new implementation - restarts, should it ignore replication metadata because it loads AOF and not RDB? That's what I've noticed. Maybe a separated issue?

@soloestoy
Copy link
Collaborator Author

@eduardobr we never support partial resync after restart from AOF, only restart from RDB have the chance.

@eduardobr
Copy link
Contributor

@soloestoy Do you know if there are limitations that require us to store metadata in the .rdb only?
For example, why can't redis have a separated file just for metadata, and use it doesn't matter if loading rdb or aof?

@oranagra
Copy link
Member

If both AOF and RDB are enabled and a replica - or master in the case of new implementation - restarts, should it ignore replication metadata because it loads AOF and not RDB? That's what I've noticed. Maybe a separated issue?

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

@yossigo @yoav-steinberg @soloestoy @madolson WDYT?

@oranagra oranagra moved this from To Do to In progress in 7.0 Aug 16, 2021
The main idea is how to allow a master to load replication info
from RDB file when rebooting, if master can load replication
info it means that replicas may have the chance to psync with
master, it can save much traffic.

The key point is we need guarantee safety and consistency, so there
are two differences between master and replica:
1. master would load the replication info as secondary ID and
   offset, in case other masters have the same replid.
2. when master loading RDB, it would propagate expired keys as DEL
   command to replication backlog, then replica can receive these
   commands to delete stale keys.
   p.s. the expired keys when RDB loading is userful for users, so
   we show it as rdb_expired_keys_last_load in info persistence.

Moreover, after load replication info, master should update
no_replica_time in case loading RDB cost too long time.
@soloestoy
Copy link
Collaborator Author

PR updated, please check @oranagra

@soloestoy
Copy link
Collaborator Author

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

I don't think it's a good time to do it, since the multi-part AOF may change a lot.

@oranagra oranagra moved this from In progress to In Review in 7.0 Aug 30, 2021
src/rdb.c Outdated Show resolved Hide resolved
tests/integration/psync2.tcl Outdated Show resolved Hide resolved
tests/integration/psync2.tcl Outdated Show resolved Hide resolved
tests/integration/psync2.tcl Outdated Show resolved Hide resolved
tests/integration/psync2.tcl Outdated Show resolved Hide resolved
@yoav-steinberg
Copy link
Contributor

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

The idea sounds good. But it raises another question, if we use RDB preambles in our AOF and RDB is also enabled then each RDB saving can be treated as an AOF rewrite. The RDB saving during shutdown is just a simple case of an AOF rewrite without buffering.

@oranagra
Copy link
Member

oranagra commented Sep 5, 2021

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

I discussed it with Yossi, and realized that just deleting the AOF on shutdown is not good since on startup, if we start without an AOF file, we'll have to do an initial rewrite or rename the RDB file, so that we can accumulate writes into it.
On the other hand, renaming the RDB file we generate to an AOF on shutdown isn't that simple either, because when we'll load it on startup we won't look at it's replid+offset for doing a PSYNC (this PR does that only for rdb files), and it also won't have an aof-preamble aux field.

bottom line, i agree we want to merge this PR as is, and re-open this discussion after multi-part AOF is implemented (or when working on it).
The think we'll want to consider then are:

  • faster startups when AOF and RDB are both enabled (no need to parse a huge AOF file)
  • ability to do a successful PSYNC in such a case
  • in a world of preamble AOF and multipart AOF, maybe there's no need to ever have them both enabled at the same time, each bgsave / presistent snapshot can also serve as an AOFRW (no AOF buffering in RAM).

@oranagra oranagra added release-notes indication that this issue needs to be mentioned in the release notes state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten labels Sep 5, 2021
@soloestoy
Copy link
Collaborator Author

soloestoy commented Sep 7, 2021

Maybe just an easy way can work: when aof-use-rdb-preamble is yes, putting commands in AOF into replication backlog after load replication info from preamble RDB.

@oranagra
Copy link
Member

oranagra commented Sep 7, 2021

Maybe just an easy way can work: when aof-use-rdb-preamble is yes, putting commands in AOF into replication backlog after load replication info from preamble RDB.

I don't like it. I'm mainly aiming for restart after a graceful shutdown, in that case we can store some metadata in the tail of the aof, but I'd also like to avoid loading a huge aof file on startup.
P.s. We do flush the replica buffers before shutting down, but if that's not enough then loading data from the aof into the backlog, we should also store the backlog into the rdb for restart.

Anyway, since there's no trivial solution I agree we should revisit that after the multipart aof change.

src/rdb.c Outdated Show resolved Hide resolved
@yoav-steinberg
Copy link
Contributor

  • in a world of preamble AOF and multipart AOF, maybe there's no need to ever have them both enabled at the same time, each bgsave / presistent snapshot can also serve as an AOFRW (no AOF buffering in RAM).

^ 👍

tests/integration/rdb.tcl Outdated Show resolved Hide resolved
@soloestoy soloestoy merged commit 794442b into redis:unstable Sep 13, 2021
@sundb
Copy link
Collaborator

sundb commented Sep 18, 2021

Related #9513.
Lately, the diskless replication read pipe cleanup test has been failing every day: https://github.com/redis/redis/runs/3593238398?check_suite_focus=true
When I revert the pr, I run the test with --loop, and the test no longer failed.

@oranagra
Copy link
Member

@sundb this should be fixed by #9513
Seems to be an unrelated bug in the test that probably got exposed by some additional lot print or timing change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten
Projects
Archived in project
7.0
Done
Development

Successfully merging this pull request may close these issues.

Redis-cluster: After restart master of cluster,it just can get a full Asynchronous Replication
7 participants