Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster can't failover when multiple nodes fail. #6871

Open
surfingPro opened this issue Feb 8, 2020 · 1 comment
Open

cluster can't failover when multiple nodes fail. #6871

surfingPro opened this issue Feb 8, 2020 · 1 comment

Comments

@surfingPro
Copy link

surfingPro commented Feb 8, 2020

I have 3 machine. Each machine has 20 Master and 20 Slave. So the cluster has 60 Master and 60 Slave.
cluster can't failover when i just shutdown a machine.(20 Master and 20 slave down) More than half of the nodes is alive.
And cluster can't failover when i stop 15 master and 15 slave.
Cluster failover successfully when i reduce number of down instances to 8 master and 8 slave.

Case of 15 master and 15 slave down:
The 3 ip such as 10.129.104.6 10.129.104.7 10.129.104.8. 15 master and 15 slave in 10.129.104.6 will be stoped.
It's strange that instances of other machines will be unavailable.
image

cluster info:
image

redis.log:
image

image

�There are some slots is not ok after 30 minutes.
image

redis.conf
daemonize no
protected-mode no
bind 10.129.104.6
dir /usr/local/redis-cluster5.0.3/data/redis-6379
pidfile /var/run/redis-cluster5.0.3/redis-6379.pid
logfile /usr/local/redis-cluster5.0.3/log/redis-6379.log
port 6379
cluster-enabled yes
cluster-config-file /usr/local/redis-cluster5.0.3/conf/node-6379.conf
cluster-node-timeout 30000
cluster-require-full-coverage no
appendonly yes
maxmemory 8gb
maxmemory-policy volatile-lru
cluster-slave-validity-factor 0

@zuiderkwast
Copy link
Contributor

It's pretty bad if a cluster can't survive a machine crash hosting a fraction of the nodes. Too bad we don't have a test case simulating this.

I'm guessing:

  • Many masters failing at the same time, many replicas try to be elected for failover, all bump their epoch at the same time and get conflicting epochs?
  • Or do even more nodes fail because they can't communicate with the other failed nodes?

@madolson Have you seen this problem? Any idea why this happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants