Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] Can't replicate data from master to slave #8614

Open
guguri12 opened this issue Mar 7, 2021 · 12 comments
Open

[CRASH] Can't replicate data from master to slave #8614

guguri12 opened this issue Mar 7, 2021 · 12 comments

Comments

@guguri12
Copy link

guguri12 commented Mar 7, 2021

Crash report

Paste the complete crash log between the quotes below. Please include a few lines from the log preceding the crash report to provide some context.

Redis cluster contains 5 master and 5 slave. And I have found can't replicate data from master to slave and slave promote to master only 1 pair. Please see the log from master node:
-------------------------------------------------------------------------------------------------
34111:M 07 Mar 2021 18:50:00.166 * Background saving terminated with success
34111:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
34111:M 07 Mar 2021 18:50:44.889 # Cluster state changed: fail
34111:M 07 Mar 2021 18:50:45.968 # Failover auth granted to ef52c82a131c1ab206379d18d6337adc597ad482 for epoch 11
34111:M 07 Mar 2021 18:50:45.986 # Cluster state changed: ok
34111:M 07 Mar 2021 18:51:01.010 * 10000 changes in 60 seconds. Saving...
34111:M 07 Mar 2021 18:51:01.444 * Background saving started by pid 26351
34111:M 07 Mar 2021 18:55:13.811 * Clear FAIL state for node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179: master without slots is reachable again.
34111:M 07 Mar 2021 18:55:13.818 # Configuration change detected. Reconfiguring myself as a replica of 122ceb58645dd2b0952abed790122c818ff30f75
34111:S 07 Mar 2021 18:55:13.820 # Connection with replica 10.1.xxx.xxx:8000 lost.
34111:S 07 Mar 2021 18:55:13.820 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
34111:S 07 Mar 2021 18:55:14.475 * Connecting to MASTER 10.1.xxx.xxx:8000
34111:S 07 Mar 2021 18:55:14.475 * MASTER <-> REPLICA sync started
34111:S 07 Mar 2021 18:55:14.475 * Non blocking connect for SYNC fired the event.
34111:S 07 Mar 2021 18:55:14.475 * Master replied to PING, replication can continue...
34111:S 07 Mar 2021 18:55:14.476 * Trying a partial resynchronization (request 2ee7f1c1389150dc756f14ca1e3548b3d17d2427:138317955027).
34111:S 07 Mar 2021 18:55:14.876 * Full resync from master: 6759d702213c0d5ef1bf1733b7b1e21d7ff8d7fe:138319544125
34111:S 07 Mar 2021 18:55:14.876 * Discarding previously cached master state.
26351:C 07 Mar 2021 18:57:49.593 * DB saved on disk
26351:C 07 Mar 2021 18:57:50.025 * RDB: 202 MB of memory used by copy-on-write
34111:S 07 Mar 2021 18:57:50.655 * Background saving terminated with success
34111:S 07 Mar 2021 18:58:47.025 * MASTER <-> REPLICA sync: receiving 20036023309 bytes from master
34111:S 07 Mar 2021 19:01:06.577 * FAIL message received from 122ceb58645dd2b0952abed790122c818ff30f75 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
34111:S 07 Mar 2021 19:02:51.054 * 10 changes in 300 seconds. Saving...
34111:S 07 Mar 2021 19:02:51.455 * Background saving started by pid 26656
34111:S 07 Mar 2021 19:02:53.322 * Clear FAIL state for node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179: replica is reachable again.
34111:S 07 Mar 2021 19:05:04.236 * Replica is about to load the RDB file received from the master, but there is a pending RDB child running. Killing process 26656 and removing its temp file to avoid any race
34111:S 07 Mar 2021 19:05:05.572 * MASTER <-> REPLICA sync: Flushing old data
34111:S 07 Mar 2021 19:07:09.554 * MASTER <-> REPLICA sync: Loading DB in memory
34111:S 07 Mar 2021 19:09:43.051 * MASTER <-> REPLICA sync: Finished with success
34111:S 07 Mar 2021 19:09:43.065 # Background saving terminated by signal 10

Aditional information

  1. OS distribution and version
    RedHAT v8.3 on VM and Redis v5.0.5
  2. Steps to reproduce (if any)
    Can't reproduce.
@guguri12 guguri12 changed the title [CRASH] [CRASH] Can't replicate data from master to slave Mar 7, 2021
@yossigo
Copy link
Member

yossigo commented Mar 8, 2021

@guguri12 I don't see a crash in the logs, just a master node being demoted to replica and performing a full sync from its new master. Please provide more information if you believe there's an issue here.

@guguri12
Copy link
Author

guguri12 commented Mar 9, 2021

@yossigo Please see in below lines. Why the master node fail over to replica node? And Why the master node can't connect to replica node? I can't finding the root cause.

34111:M 07 Mar 2021 18:50:00.166 * Background saving terminated with success
34111:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
34111:M 07 Mar 2021 18:50:44.889 # Cluster state changed: fail
34111:M 07 Mar 2021 18:50:45.968 # Failover auth granted to ef52c82a131c1ab206379d18d6337adc597ad482 for epoch 11
34111:M 07 Mar 2021 18:50:45.986 # Cluster state changed: ok
34111:M 07 Mar 2021 18:51:01.010 * 10000 changes in 60 seconds. Saving...
34111:M 07 Mar 2021 18:51:01.444 * Background saving started by pid 26351
34111:M 07 Mar 2021 18:55:13.811 * Clear FAIL state for node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179: master without slots is reachable again.
34111:M 07 Mar 2021 18:55:13.818 # Configuration change detected. Reconfiguring myself as a replica of 122ceb58645dd2b0952abed790122c818ff30f75
34111:S 07 Mar 2021 18:55:13.820 # Connection with replica 10.1.xxx.xxx:8000 lost.
34111:S 07 Mar 2021 18:55:13.820 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
34111:S 07 Mar 2021 18:55:14.475 * Connecting to MASTER 10.1.xxx.xxx:8000

@yossigo
Copy link
Member

yossigo commented Mar 9, 2021

@guguri12 It was instructed by another node to fail over, after a majority of the nodes has determined the node is indeed failing. To see the full picture, you'll need to look at the logs of the other nodes as well.

@guguri12
Copy link
Author

guguri12 commented Mar 9, 2021

@yossigo Do you mean to look at the log file from replica node? In replica node, the message has shown as below:

8674:S 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
8674:S 07 Mar 2021 18:50:44.889 # Cluster state changed: fail
8674:S 07 Mar 2021 18:50:45.987 # Cluster state changed: ok
30728:C 07 Mar 2021 18:51:26.740 * DB saved on disk
30728:C 07 Mar 2021 18:51:27.198 * RDB: 751 MB of memory used by copy-on-write
8674:S 07 Mar 2021 18:51:27.800 * Background saving terminated with success
8674:S 07 Mar 2021 18:51:28.035 * FAIL message received from b991ee84cfff581481bf630437bea929255681ce about 411331e5641bcbd7b4906e36e8e1783a50064c24
8674:S 07 Mar 2021 18:51:28.035 # Cluster state changed: fail
8674:S 07 Mar 2021 18:51:28.102 # Start of election delayed for 788 milliseconds (rank #0, offset 138317741254).
8674:S 07 Mar 2021 18:51:28.906 # Starting a failover election for epoch 12.
8674:S 07 Mar 2021 18:51:28.911 # Failover election won: I'm the new master.
8674:S 07 Mar 2021 18:51:28.911 # configEpoch set to 12 after successful failover
8674:M 07 Mar 2021 18:51:28.911 # Setting secondary replication ID to 2ee7f1c1389150dc756f14ca1e3548b3d17d2427, valid up to offset: 138317741255. New replication ID is 6759d702213c0d5ef1bf1733b7b1e21d7ff8d7fe
8674:M 07 Mar 2021 18:51:28.911 # Connection with master lost.

@yossigo
Copy link
Member

yossigo commented Mar 9, 2021

@guguri12 No, the decision to initiate failover is reached by a majority of master nodes. You need to look at the logs of all master nodes to try to understand the sequence of events.

@guguri12
Copy link
Author

guguri12 commented Mar 9, 2021

@yossigo Do you mean view log file of the master node in the same shard or other master node?

@yossigo
Copy link
Member

yossigo commented Mar 9, 2021

@guguri12 All master instances.

@guguri12
Copy link
Author

guguri12 commented Mar 9, 2021

@yossigo When the redis client connect to redis server cluster, should i set configuration to all instances (master and slave) ?

@guguri12
Copy link
Author

guguri12 commented Mar 9, 2021

@yossigo I have attached the log files from all master instances. After review again, i think there are 2 master instances that found this issue.

@yossigo
Copy link
Member

yossigo commented Mar 10, 2021

@guguri12 configuration is done locally on a per-instance basis, so yes if you perform CONFIG SET you need to apply it to all relevant Redis instances.

There are no logs attached, have you been able to trace them and get to a resolution? Or do you still suspect there is an issue with Redis here?

@guguri12
Copy link
Author

master_node_logs.txt
@yossigo I didn't run command config set at that time. So i think there are something in redis server. Please kindly help to review the redis log file from master instances.

@yossigo
Copy link
Member

yossigo commented Mar 11, 2021

@guguri12 Looks like you have experienced a temporary node/network failure which resulted with a quorum decision to failover the node. Later on the failed over node was again communicating with the other nodes and re-synced with the new master as a replica.

This is where it all started:

34111:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
33609:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
32484:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
33018:M 07 Mar 2021 18:50:44.883 * Marking node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179 as failing (quorum reached).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants