Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full resync from manual failover unexpected #4819

Open
ryan-shaw opened this issue Apr 5, 2018 · 14 comments
Open

Full resync from manual failover unexpected #4819

ryan-shaw opened this issue Apr 5, 2018 · 14 comments

Comments

@ryan-shaw
Copy link

ryan-shaw commented Apr 5, 2018

We needed to manually failover the Redis master to upgrade the host hardware. We wan SENTINEL FAILVOVER <master> to trigger the failover, and expected the slaves to reconfigure to the new master and perform a partial resync. However, both slaves performed a full resync, this is what was in the masters log

Partial resynchronization not accepted: Requested offset for second ID was 382879226480, but I can reply up to 382879184738

From what I understand this means that the slave is actually ahead of the new master, so the slaves clears the data and resyncs to the new master, which means there is data loss, correct?

redis 4.0.6

@soloestoy
Copy link
Collaborator

It's the original master's log, or a master which slave promoted to after failover?

@ryan-shaw
Copy link
Author

ryan-shaw commented Apr 7, 2018 via email

@soloestoy
Copy link
Collaborator

Partial resynchronization not accepted: Requested offset for second ID was 382879226480, but I can reply up to 382879184738

From the log you give, I believe it's not the original master's log, it is a master which a slave promoted to after failover as I said before.

After 4.0 redis proposes a new synchronization mechanism PSYNC2, that means after failover slaves have chance to perform a partial resync. But there has some requirements, one of them is that slaves's offset must be not ahead of the new master, but the new master cannot guarantee that because it's a slave before, it may have gap with master event behind other slaves.

Just see the code below:

    /* Is the replication ID of this master the same advertised by the wannabe
     * slave via PSYNC? If the replication ID changed this master has a
     * different replication history, and there is no way to continue.
     *
     * Note that there are two potentially valid replication IDs: the ID1
     * and the ID2. The ID2 however is only valid up to a specific offset. */
    if (strcasecmp(master_replid, server.replid) &&
        (strcasecmp(master_replid, server.replid2) ||
         psync_offset > server.second_replid_offset))
    {
        /* Run id "?" is used by slaves that want to force a full resync. */
        if (master_replid[0] != '?') {
            if (strcasecmp(master_replid, server.replid) &&
                strcasecmp(master_replid, server.replid2))
            {
                serverLog(LL_NOTICE,"Partial resynchronization not accepted: "
                    "Replication ID mismatch (Slave asked for '%s', my "
                    "replication IDs are '%s' and '%s')",
                    master_replid, server.replid, server.replid2);
            } else {
                serverLog(LL_NOTICE,"Partial resynchronization not accepted: "
                    "Requested offset for second ID was %lld, but I can reply "
                    "up to %lld", psync_offset, server.second_replid_offset);
            }
        } else {
            serverLog(LL_NOTICE,"Full resync requested by slave %s",
                replicationGetSlaveName(c));
        }
        goto need_full_resync;
    }

Now I think you can get the point why the slaves cannot perform a partial resync ^_^

@soloestoy
Copy link
Collaborator

And as a manual failover, I think what should do at first is master
should refuse all writes and wait, until the slave which we choose to promote to master has the same offset with the master.

@ryan-shaw
Copy link
Author

@soloestoy yep, that is what I think needs to happen too, data loss is inevitable with a failure and automatic failover, but with a manual failover I don't believe data loss is acceptable.

@celesteking
Copy link

Guys, this is total bullshit. What's the sense in setting up sentinel infrastructure if it can't handle a simple task of failing over the host properly?

I've crafted a simple test setup populated with gig of keys, then I triggered a manual failover. Exactly as described over here, the new master, which was a slave before, turned itself into master only after a full resync. In my case, there were absolutely no writes to the entire infrastructure -- everything was idle, there even were no reads. Despite that, it requested a full sync.

Now, if that's the expected outcome, the sentinel doc needs to be updated and a warning set that if you are planning to fail over the hosts manually (and automatically, I guess, as I see absolutely no difference) , the switch over will require a full resync, which with gigabytes data will take A LOT. During that period, there will be data loss.

Secondly, there's something wonky with the sentinel +switch-master messages. They come from different sentinels at different times. This is unacceptable. sentinels should act together and react together, there should be no discrepancies.

# on leader sentinel
88:X 21 Nov 18:01:54.848 # Executing user requested FAILOVER of 'mred1'
88:X 21 Nov 18:01:54.848 # +new-epoch 134
88:X 21 Nov 18:01:54.848 # +try-failover master mred1 192.168.112.4 6379
88:X 21 Nov 18:01:54.857 # +vote-for-leader 7c7284fcdb584bf640c6a420ce8ed63a030160c4 134
88:X 21 Nov 18:01:54.857 # +elected-leader master mred1 192.168.112.4 6379
88:X 21 Nov 18:01:54.857 # +failover-state-select-slave master mred1 192.168.112.4 6379
88:X 21 Nov 18:01:54.940 # +selected-slave slave 192.168.112.2:6379 192.168.112.2 6379 @ mred1 192.168.112.4 6379
88:X 21 Nov 18:01:54.940 * +failover-state-send-slaveof-noone slave 192.168.112.2:6379 192.168.112.2 6379 @ mred1 192.168.112.4 6379
88:X 21 Nov 18:01:55.011 * +failover-state-wait-promotion slave 192.168.112.2:6379 192.168.112.2 6379 @ mred1 192.168.112.4 6379
88:X 21 Nov 18:01:55.641 # +promoted-slave slave 192.168.112.2:6379 192.168.112.2 6379 @ mred1 192.168.112.4 6379
88:X 21 Nov 18:01:55.641 # +failover-state-reconf-slaves master mred1 192.168.112.4 6379
88:X 21 Nov 18:01:55.728 * +slave-reconf-sent slave 192.168.112.3:6379 192.168.112.3 6379 @ mred1 192.168.112.4 6379
88:X 21 Nov 18:01:56.675 * +slave-reconf-inprog slave 192.168.112.3:6379 192.168.112.3 6379 @ mred1 192.168.112.4 6379
88:X 21 Nov 18:02:04.010 * +slave-reconf-done slave 192.168.112.3:6379 192.168.112.3 6379 @ mred1 192.168.112.4 6379
88:X 21 Nov 18:02:04.093 # +failover-end master mred1 192.168.112.4 6379
88:X 21 Nov 18:02:04.093 # +switch-master mred1 192.168.112.4 6379 192.168.112.2 6379
88:X 21 Nov 18:02:04.093 * +slave slave 192.168.112.3:6379 192.168.112.3 6379 @ mred1 192.168.112.2 6379
88:X 21 Nov 18:02:04.093 * +slave slave 192.168.112.4:6379 192.168.112.4 6379 @ mred1 192.168.112.2 6379

# on other sentinel
13:X 21 Nov 18:01:55.192 # +new-epoch 134
13:X 21 Nov 18:01:55.728 # +config-update-from sentinel 7c7284fcdb584bf640c6a420ce8ed63a030160c4 192.168.112.2 26379 @ mred1 192.168.112.4 6379
13:X 21 Nov 18:01:55.728 # +switch-master mred1 192.168.112.4 6379 192.168.112.2 6379
13:X 21 Nov 18:01:55.728 * +slave slave 192.168.112.3:6379 192.168.112.3 6379 @ mred1 192.168.112.2 6379
13:X 21 Nov 18:01:55.728 * +slave slave 192.168.112.4:6379 192.168.112.4 6379 @ mred1 192.168.112.2 6379
13:X 21 Nov 18:02:05.826 * +convert-to-slave slave 192.168.112.4:6379 192.168.112.4 6379 @ mred1 192.168.112.2 6379
# 3rd sentinel purposely down

This is just the tip of the iceberg. This is no high availability, this is bullshit, guys.

@ryan-shaw
Copy link
Author

ryan-shaw commented Nov 22, 2018

@celesteking that doesn't sound like what I'm trying to explain in my issue. My issue involves a write workload and when a manual failover is requested some of the masters writes don't copy to the slave even though the master is still available therefore losing some keys, no full sync occurred.

I'm not sure you are describing the situation accurately, could you post all the logs during the failover? A new master wouldn't perform a full resync, as it's the master and has no master to sync from. A slave could request a full/partial resync during a failover situation, a partial sync can only happen if the master stored enough data in its replication buffer.

Also, the sentinel messages you see I believe are from the propagation of the info of which is the new master.

@sczizzo
Copy link

sczizzo commented Mar 7, 2019

@celesteking I ran into that recently myself. I definitely share your frustration.

I think the unfortunate answer is it works that way by design.

From https://redis.io/topics/sentinel (emphasis mine):

In general Redis + Sentinel as a whole are a an eventually consistent system where the merge function is last failover wins, and the data from old masters are discarded to replicate the data of the current master, so there is always a window for losing acknowledged writes. This is due to Redis asynchronous replication and the discarding nature of the "virtual" merge function of the system. Note that this is not a limitation of Sentinel itself, and if you orchestrate the failover with a strongly consistent replicated state machine, the same properties will still apply. There are only two ways to avoid losing acknowledged writes:

  1. Use synchronous replication (and a proper consensus algorithm to run a replicated state machine).
  2. Use an eventually consistent system where different versions of the same object can be merged.

Redis currently is not able to use any of the above systems, and is currently outside the development goals. However there are proxies implementing solution "2" on top of Redis stores such as SoundCloud Roshi, or Netflix Dynomite.

Some historical discussion:

@jbmassicotte
Copy link

I may not agree with @celesteking choice of words, but I agree with his assessment: the manual failover does not work properly, ie, it always results in full re-sync, even though no writes are happening.

In spite of the various statements and attempts at explanation from others in this thread, I refuse to believe that 1) this is acceptable, and 2) this is the expected behavior.

We are trying to move our 100M data records from a large RDBMS to a more nimble solution using redis. All prototyping and experimentation with redis have been very positive up to now except for this full re-sync situation which results in our 20 nodes system to be partially out of service for almost 30 minutes whenever we need to perform maintenance on the master node.

Please somebody tell me I am doing something wrong, eg, my config is messed up, or I am using a buggy version of redis (using 4.0.12) !

Last note on this topic: the ‘automatic failover’ (ie, I stop the master and let sentinels detect the failed master) seems to reliably result in a partial re-sync, which is what you expect. Unclear why this behavior would be different from the manual failover. So yes I may opt to stop the master node whenever maintenance is required, rather than using the manual failover.

@sczizzo
Copy link

sczizzo commented Mar 8, 2019

During a manual failover with Redis+Sentinel, there's a short window of time where one of the old slaves has been elected the new master, but the other old slaves and the old master haven't witnessed the election, so they continue to accept writes. The new master never sees those writes, though, so when the new master is finally recognized, it denies the partial sync from the old master/slaves (their replication offset is too high).

During an "automatic" failover, the old master isn't around, so no writes are accepted during failover, and you don't run into this situation.

One potential solution for the manual failover scenario: Temporarily config set min-slaves-to-write $NUM_SLAVES on the current master, trigger the failover, and restore the old setting when done. This way, no writes can proceed during failover.

@antirez
Copy link
Contributor

antirez commented Mar 8, 2019

Hello, I think we should change Sentinel manual failover orchestration to be like Redis Cluster. In Redis Cluster all this works as expected because the master will stop allowing clients to write, will make sure that the salves received every write, and then will continue with the failover process. As @soloestoy said we should have in Sentinel the same exact operations regarding manual failover. Right now Sentinel manual failover is instead completely non-orchestrated. What will happen is just that the failure detection is not needed for the server to failover, and that's it. Btw AFAIK it is simple even before the feature is implemented to orchestrate a manual failover by using CLIENT PAUSE, sending CLIENT PAUSE to the master, then waiting for the offsets to settle (in a more crude way this can just be, waiting a few seconds), and finally calling the SENTINEL FAILOVER command. I wonder if @soloestoy thinks likewise here.

@jbmassicotte
Copy link

Well look at that! Using the CLIENT PAUSE command as suggested by antirez clearly helps (thank you!).

What puzzles me though is that there are no writes occurring during my tests. No writes, no updates, zip. So why is it one needs to suspend the clients? Is there some kind of background traffic, not related to updates, that causes the slaves’ offsets to increment?

@sabretus
Copy link

Quick question @antirez
There also is WAIT command which is supposed to do similar to CLIENT PAUSE, is that correct?

@ullumullu
Copy link

ullumullu commented Oct 25, 2019

Just to confirm a full resync is expected to happen on an idle / fresh Redis Sentinel setup? I always see a full resync on a manual sentinel failover no matter if I do a CLIENT PAUSE on the leader or not. Running on Redis 5.0.5 here.

Some details: I guess a full resync will happen because sentinels can't demote the old leader during the client pause. Afterwards the offset counter increases for whatever reason when sentinels start to monitor the old leader again which increases it's offset to a higher number than the new leader.

What's also unclear to me why sentinels health check do increase the replication offset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants