PSYNC2: make partial sync possible after master reboot #8015

soloestoy · 2020-11-04T13:29:49Z

This PR aims to solve issue #6030, about making psync possible after master reboot, at first I wrote some codes in #6034 but it's not good enough, after a discussion with @oranagra , I modify the code(about the expire problem) and open the new one, here ping @redis/core-team I'd like to have your suggestions.

The main idea is how to allow a master to load replication info from RDB file when rebooting, if master can load replication info it means that replicas may have the chance to psync with master, it can save much traffic.

The key point is we need guarantee safety and consistency, so there
are two differences between master and replica:

master would load the replication info as secondary ID and
offset, in case other masters have the same replid.
when master loading RDB, it would propagate expired keys as DEL
command to replication backlog, then replica can receive these
commands to delete stale keys.
p.s. the expired keys when RDB loading is useful for users, so
we show it as rdb_last_load_keys_expired and rdb_last_load_keys_loaded in info persistence.

Moreover, after load replication info, master should update
no_replica_time in case loading RDB cost too long time.

oranagra

@soloestoy i think we better have a test for that.
@guybe7 please have a look.

(adding major decision due to the new info field)

src/rdb.c

src/server.c

trevor211 · 2020-11-06T01:56:27Z

I think this PR tries to solve partial sync when master reboots before replicas do failover.
While it is safe to save rsi.repl_id to server.replid2，it is doomed to do a full sync if one of its replicas got promoted.
If we save rsi.repl_id to server.replid as before, it is possible to trigger a partial sync.
Non of the them is perfect, but we needs a decision.

ShooterIT · 2020-12-17T09:03:36Z

Hi @oranagra @soloestoy I think that's a great idea, especially, we have changed 'random sampling garbage collector' to 'consecutive garbage collector' in activeExpireCycle. It has much chance to implement partial resynchronizations if restarting time is less than ttl of all keys. I think we should push it forward.

I think what oran said makes sense, maybe we can refactor master behavior. I notice we don't create master replid.

/* If no expired key is deleted when load RDB, we may have chance to
 * restore replication context to implement partial resynchronizations. */
if (!server.rdb_expired_keys_last_load) {
    memcpy(server.replid,rsi.repl_id,sizeof(server.replid));
    server.master_repl_offset = rsi.repl_offset;
    if (!iAmMaster()) {
        /* If this is a replica, create a cached master from this
         * information, in order to allow partial resynchronizations
         * with masters, like REPLICAOF ip port. */
        replicationCacheMasterUsingMyself();
        selectDb(server.cached_master,rsi.repl_stream_db);
    } else {
        /* If this is a master, shift the current replication ID and
         * offset as secondary replication ID to allow replicas to partial
         * resynchronizations with masters, like REPLICAOF NO ONE. */
        shiftReplicationId();
        createReplicationBacklog();
        server.repl_no_slaves_since = time(NULL);
    }
}

Just FYI, no test :)

ShooterIT · 2020-12-21T04:51:29Z

To avoid expired keys to make replication context invalid, could we add a config delay-expire-after-start?
In most cases, HA component(such as config server, sentinel, or cluster mechanism) will finish failover when master restarts, so I think it is more possible that after restarting, the master will be changed into slave role later. We can delay to expire keys to keep replication context valid to implement partial resynchronizations.

oranagra · 2021-07-22T08:35:40Z

@soloestoy seems we have forgotten this PR although i don't see any major issues that were discussed. let's pick it up again and finish it for 7.0?

Regarding the last comment by @ShooterIT, i don't like to add that config, but maybe we can write to the replication backlog when we're expiring keys during rdb load, this way if there weren't too many expired keys, replicas will still be able to psync.

soloestoy · 2021-07-22T08:47:40Z

@oranagra I will rebase this PR.

but maybe we can write to the replication backlog when we're expiring keys during rdb load, this way if there weren't too many expired keys, replicas will still be able to psync.

I like this approach.

soloestoy · 2021-07-22T09:04:10Z

but maybe we can write to the replication backlog when we're expiring keys during rdb load, this way if there weren't too many expired keys, replicas will still be able to psync.

@oranagra Sadly, this cannot work, because PING command received by replica may disturb the offset, lead to inconsistency between replica and rebooted master.

If we can make PING out of replication(I did it in PR #8440), this approach can work.

oranagra · 2021-07-22T09:28:57Z

the server just restarted, if there are any slaves already connected, they won't be in an ONLINE state yet.
so i think we can change replicationCron to avoid polluting the backlog at all.
but also note that PSYNC is not an ok-loading command, so i don't think we'll have any replicas at all, i think the code is already ok in that respect...
am i missing something? if so, i think it is probably solvable (without a generic solution for separating the PINGs from the repl-stream)

soloestoy · 2021-07-22T09:51:04Z

@oranagra I mean the PING command before restart:

Time 1. master offset is 100 without PING and generate an RDB with offset 100
Time 2. master send PING to replica, offset grows to 114
Time 3. master restart from RDB and the initial offset is 100
Time 4. master delete stale keys and write to replication backlog during RDB loading, and offset may grows greater than 114
Time 5. after master restarted, replicate use PSYNC repl_id 114 to connect master, it can work avoid FULLRESYNC, but data (offset) is wrong(data from 100 to 114 lost).

oranagra · 2021-07-22T10:09:20Z

@soloestoy that's right, but what's the difference between that case and the case in which we don't DEL commands on behalf of expired keys? i.e. a master restarts from rdb file, then receives a bunch of commands from clients before the replica attempts to PSYNC.
i suppose the solution for that is that we shiftReplicationId before putting things into the backlog, so we should do that before adding the DEL commands too, and the problem you described is solved.

besides, if i understand correctly, the main issue that this PR comes to solve is about a graceful restart of the master (one that saves an rdb file on shutdown), so in that case the problem you described doesn't exist.
this means that we can add an AUX hint to the rdb saying that it's an rdb file of a graceful shutdown and only resurrect the replication id in that case (but as noted above, i don't think that's necessary).

p.s. another improvement to that problem, and possibly to other problems is that when the server goes down gracefully, it stores the replication backlog into the rdb file (and both replid1 and replid2 and their offsets).
then on a restart replicas are able to sync even if they where dropped before the graceful shutdown.
If we take this further, we may even wanna store the replication backlog in the rdb file we send to replicas, so a replica that just synced and immediately after gets promoted, can serve PSYNC from other replicas that have older replids.

soloestoy · 2021-07-22T11:09:50Z

@oranagra thanks, I got it.

eduardobr · 2021-07-22T13:46:14Z

If both AOF and RDB are enabled and a replica - or master in the case of new implementation - restarts, should it ignore replication metadata because it loads AOF and not RDB? That's what I've noticed. Maybe a separated issue?

soloestoy · 2021-07-23T08:23:52Z

@eduardobr we never support partial resync after restart from AOF, only restart from RDB have the chance.

eduardobr · 2021-07-23T08:33:09Z

@soloestoy Do you know if there are limitations that require us to store metadata in the .rdb only?
For example, why can't redis have a separated file just for metadata, and use it doesn't matter if loading rdb or aof?

oranagra · 2021-08-12T18:34:38Z

If both AOF and RDB are enabled and a replica - or master in the case of new implementation - restarts, should it ignore replication metadata because it loads AOF and not RDB? That's what I've noticed. Maybe a separated issue?

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

@yossigo @yoav-steinberg @soloestoy @madolson WDYT?

The main idea is how to allow a master to load replication info from RDB file when rebooting, if master can load replication info it means that replicas may have the chance to psync with master, it can save much traffic. The key point is we need guarantee safety and consistency, so there are two differences between master and replica: 1. master would load the replication info as secondary ID and offset, in case other masters have the same replid. 2. when master loading RDB, it would propagate expired keys as DEL command to replication backlog, then replica can receive these commands to delete stale keys. p.s. the expired keys when RDB loading is userful for users, so we show it as rdb_expired_keys_last_load in info persistence. Moreover, after load replication info, master should update no_replica_time in case loading RDB cost too long time.

soloestoy · 2021-08-30T03:52:25Z

PR updated, please check @oranagra

soloestoy · 2021-08-30T03:53:10Z

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

I don't think it's a good time to do it, since the multi-part AOF may change a lot.

src/rdb.c

tests/integration/psync2.tcl

yoav-steinberg · 2021-09-03T09:13:01Z

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

The idea sounds good. But it raises another question, if we use RDB preambles in our AOF and RDB is also enabled then each RDB saving can be treated as an AOF rewrite. The RDB saving during shutdown is just a simple case of an AOF rewrite without buffering.

oranagra · 2021-09-05T07:54:14Z

Maybe when redis saves an RDB file during shutdown it should delete the AOF? and when it starts, copy the RDB file to an AOF (serving as preamble AOF).
This would not only solve the partial sync problem, but also make a much faster load time on startup (loading AOF can take very long).
On the other hand, if an unexpected crash happens, it would load the AOF file on startup.

I discussed it with Yossi, and realized that just deleting the AOF on shutdown is not good since on startup, if we start without an AOF file, we'll have to do an initial rewrite or rename the RDB file, so that we can accumulate writes into it.
On the other hand, renaming the RDB file we generate to an AOF on shutdown isn't that simple either, because when we'll load it on startup we won't look at it's replid+offset for doing a PSYNC (this PR does that only for rdb files), and it also won't have an aof-preamble aux field.

bottom line, i agree we want to merge this PR as is, and re-open this discussion after multi-part AOF is implemented (or when working on it).
The think we'll want to consider then are:

faster startups when AOF and RDB are both enabled (no need to parse a huge AOF file)
ability to do a successful PSYNC in such a case
in a world of preamble AOF and multipart AOF, maybe there's no need to ever have them both enabled at the same time, each bgsave / presistent snapshot can also serve as an AOFRW (no AOF buffering in RAM).

soloestoy · 2021-09-07T08:57:18Z

Maybe just an easy way can work: when aof-use-rdb-preamble is yes, putting commands in AOF into replication backlog after load replication info from preamble RDB.

oranagra · 2021-09-07T12:26:12Z

Maybe just an easy way can work: when aof-use-rdb-preamble is yes, putting commands in AOF into replication backlog after load replication info from preamble RDB.

I don't like it. I'm mainly aiming for restart after a graceful shutdown, in that case we can store some metadata in the tail of the aof, but I'd also like to avoid loading a huge aof file on startup.
P.s. We do flush the replica buffers before shutting down, but if that's not enough then loading data from the aof into the backlog, we should also store the backlog into the rdb for restart.

Anyway, since there's no trivial solution I agree we should revisit that after the multipart aof change.

src/rdb.c

yoav-steinberg · 2021-09-09T06:39:05Z

in a world of preamble AOF and multipart AOF, maybe there's no need to ever have them both enabled at the same time, each bgsave / presistent snapshot can also serve as an AOFRW (no AOF buffering in RAM).

^ 👍

tests/integration/rdb.tcl

tests/integration/psync2-master-restart.tcl

sundb · 2021-09-18T02:14:16Z

Related #9513.
Lately, the diskless replication read pipe cleanup test has been failing every day: https://github.com/redis/redis/runs/3593238398?check_suite_focus=true
When I revert the pr, I run the test with --loop, and the test no longer failed.

oranagra · 2021-09-18T05:12:45Z

@sundb this should be fixed by #9513
Seems to be an unrelated bug in the test that probably got exposed by some additional lot print or timing change.

oranagra linked an issue Nov 4, 2020 that may be closed by this pull request

Redis-cluster: After restart master of cluster,it just can get a full Asynchronous Replication #6030

Closed

oranagra mentioned this pull request Nov 4, 2020

Modify master-slave replication #6031

Closed

oranagra reviewed Nov 4, 2020

View reviewed changes

src/rdb.c Outdated Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

oranagra added the state:major-decision Requires core team consensus label Nov 4, 2020

soloestoy force-pushed the psync2-master-load-replinfo branch from ac02828 to 9e82c2f Compare November 5, 2020 11:09

trevor211 reviewed Nov 6, 2020

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

oranagra mentioned this pull request Jul 22, 2021

PSYNC3: next improvements to Redis replication #4357

Open

oranagra added this to Backlog in 7.0 via automation Jul 22, 2021

oranagra moved this from Backlog to To Do in 7.0 Jul 22, 2021

soloestoy force-pushed the psync2-master-load-replinfo branch from 9e82c2f to 8b5bab2 Compare July 22, 2021 09:43

oranagra mentioned this pull request Aug 12, 2021

Replica keep serving data during repl-diskless-load=swapdb for better availability #9323

Merged

oranagra moved this from To Do to In progress in 7.0 Aug 16, 2021

soloestoy force-pushed the psync2-master-load-replinfo branch from 8b5bab2 to 1a39dc0 Compare August 27, 2021 07:49

add test cases for Partial resync after Master restart

7e1359c

oranagra moved this from In progress to In Review in 7.0 Aug 30, 2021

oranagra reviewed Sep 2, 2021

View reviewed changes

oranagra added release-notes indication that this issue needs to be mentioned in the release notes state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten labels Sep 5, 2021

soloestoy added 2 commits September 7, 2021 11:43

move server.rdb_expired_keys_last_load to resetServerStats()

395c75c

move test cases to psync2-master-restart.tcl

dea58ce

remove metric reset from startLoading

3ed9315

oranagra approved these changes Sep 7, 2021

View reviewed changes

soloestoy commented Sep 8, 2021

View reviewed changes

src/rdb.c Outdated Show resolved Hide resolved

show rdb load info in INFO reply

9fbdb75

oranagra approved these changes Sep 10, 2021

View reviewed changes

tests/integration/rdb.tcl Outdated Show resolved Hide resolved

oranagra reviewed Sep 13, 2021

View reviewed changes

tests/integration/psync2-master-restart.tcl Outdated Show resolved Hide resolved

reduce test time

c83d5a7

soloestoy force-pushed the psync2-master-load-replinfo branch from 1fd5545 to c83d5a7 Compare September 13, 2021 06:40

oranagra approved these changes Sep 13, 2021

View reviewed changes

soloestoy merged commit 794442b into redis:unstable Sep 13, 2021

oranagra moved this from In Review to Done in 7.0 Sep 19, 2021

eduardobr mentioned this pull request Nov 16, 2021

Implement Multi Part AOF mechanism to avoid AOFRW overheads. #9788

Merged

2 tasks

oranagra mentioned this pull request Nov 17, 2021

Make it possible for a master to restart from AOF and still be able to serve PSYNC #9796

Open

enjoy-binbin mentioned this pull request Nov 30, 2021

Add rdb_last_load_keys_expired/rdb_last_load_keys_loaded fields in INFO redis/redis-doc#1696

Merged

sundb mentioned this pull request Nov 18, 2022

Performance degrade 7.0.3 vs 6.2.7 #10981

Closed

sundb mentioned this pull request Apr 20, 2023

Drop replication backlog after restarting with expired if RDB doesn't have replication info #12080

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSYNC2: make partial sync possible after master reboot #8015

PSYNC2: make partial sync possible after master reboot #8015

soloestoy commented Nov 4, 2020 •

edited

oranagra left a comment

trevor211 commented Nov 6, 2020

ShooterIT commented Dec 17, 2020 •

edited

ShooterIT commented Dec 21, 2020

oranagra commented Jul 22, 2021

soloestoy commented Jul 22, 2021

soloestoy commented Jul 22, 2021

oranagra commented Jul 22, 2021

soloestoy commented Jul 22, 2021 •

edited

oranagra commented Jul 22, 2021

soloestoy commented Jul 22, 2021

eduardobr commented Jul 22, 2021

soloestoy commented Jul 23, 2021

eduardobr commented Jul 23, 2021

oranagra commented Aug 12, 2021

soloestoy commented Aug 30, 2021

soloestoy commented Aug 30, 2021

yoav-steinberg commented Sep 3, 2021

oranagra commented Sep 5, 2021

soloestoy commented Sep 7, 2021 •

edited

oranagra commented Sep 7, 2021

yoav-steinberg commented Sep 9, 2021

sundb commented Sep 18, 2021

oranagra commented Sep 18, 2021

PSYNC2: make partial sync possible after master reboot #8015

PSYNC2: make partial sync possible after master reboot #8015

Conversation

soloestoy commented Nov 4, 2020 • edited

oranagra left a comment

Choose a reason for hiding this comment

trevor211 commented Nov 6, 2020

ShooterIT commented Dec 17, 2020 • edited

ShooterIT commented Dec 21, 2020

oranagra commented Jul 22, 2021

soloestoy commented Jul 22, 2021

soloestoy commented Jul 22, 2021

oranagra commented Jul 22, 2021

soloestoy commented Jul 22, 2021 • edited

oranagra commented Jul 22, 2021

soloestoy commented Jul 22, 2021

eduardobr commented Jul 22, 2021

soloestoy commented Jul 23, 2021

eduardobr commented Jul 23, 2021

oranagra commented Aug 12, 2021

soloestoy commented Aug 30, 2021

soloestoy commented Aug 30, 2021

yoav-steinberg commented Sep 3, 2021

oranagra commented Sep 5, 2021

soloestoy commented Sep 7, 2021 • edited

oranagra commented Sep 7, 2021

yoav-steinberg commented Sep 9, 2021

sundb commented Sep 18, 2021

oranagra commented Sep 18, 2021

soloestoy commented Nov 4, 2020 •

edited

ShooterIT commented Dec 17, 2020 •

edited

soloestoy commented Jul 22, 2021 •

edited

soloestoy commented Sep 7, 2021 •

edited