Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Mode: Master fails over to slave, when slave fails over again redisson client retains connection to previous dead node. #1481

Closed
ericwu1 opened this issue Jun 11, 2018 · 4 comments
Labels
bug
Milestone

Comments

@ericwu1
Copy link

@ericwu1 ericwu1 commented Jun 11, 2018

Expected behavior

Redisson is able to send read and write commands to the 2nd newly elected Master.

Actual behavior

This is the topology:
71a80a4554db9893e194545a45e4caeaf159a726 10.45.248.93:6379 master - 0 1527887027479 76 connected 0-5460 01f30ae68e733f9b6394efd05577c4d1d6ab011a 10.45.247.71:6379 slave 22b3f1ec6e436d206c2757a9fbf97b325d753486 0 1527887028983 93 connected 22b3f1ec6e436d206c2757a9fbf97b325d753486 10.45.247.72:6379 master - 0 1527887027981 80 connected 10922-16383 dfa03dc3fdcb917641a8d4e85e8373ee19e1fda8 10.45.248.80:6379 master - 0 1527887028482 93 connected 5461-10921 b7695f9238909af61f158deafe7f4f33ffd2eae4 10.45.248.134:6379 slave dfa03dc3fdcb917641a8d4e85e8373ee19e1fda8 0 1527887028983 93 connected a6a42df404d5adac79203eb9bc3fc49bcc03fd19 10.45.248.106:6379 myself,slave 71a80a4554db9893e194545a45e4caeaf159a726 0 0 61 connected 5c421bf7466e8702bbf344de9ae528fa8fbea1d7 10.45.247.74:6379 slave dfa03dc3fdcb917641a8d4e85e8373ee19e1fda8 0 1527887029482 93 connected 9640188f1b2214faa6c26a1755179803ebafad7c 10.45.247.75:6379 slave 71a80a4554db9893e194545a45e4caeaf159a726 0 1527887027981 76 connected
When testing various failover situations, I ran into a situation which causes the redisson client to get stuck using a bad connection when making the write calls, however the thread that is polling for the topology is fine. The situation is the redis process dies on a Master node I am writing to, a slave is then promoted to master. The redisson client is still in a GOOD state. The redis process dies on the new Master, a floating slave is then promoted to Master. At this point, the redisson client is now STUCK using a bad connection to a dead master.

Steps to reproduce or test case

Run 7 node cluster, I believe this is the minimum to reproduce this as you need an extra floating slave to promote after the first Master/Slave pair dies. For Master 1 slots 0-5460 (M1), kill the redis process. This should cause it's slave (S1) to promote to master and the floating slave (FS2) to attach to this new master. Then repeat kill the redis process on S1 which was promoted to master and FS1 should be promoted to master. The redisson client will now fail reads/writes because the redisson client connections are to the old S1 node despite FS1 now being the new master but does not fail to poll for CLUSTER NODES because I believe it is using a separate connection pool/thread to do that.

Redis version

3.2.11

Redisson version

3.7.0

Redisson configuration

Default redis cluster config with 7 nodes. 3 Master 4 Slave.

@mrniko mrniko added this to the 2.12.3 milestone Jun 12, 2018
@mrniko mrniko added the bug label Jun 25, 2018
@mrniko

This comment has been minimized.

Copy link
Member

@mrniko mrniko commented Jun 25, 2018

Unable to reproduce it with 3.7.2 version, below is my test case:

    @Test
    public void testFailoverInCluster() throws Exception {
        RedisRunner master1 = new RedisRunner().port(6890).randomDir().nosave();
        RedisRunner master2 = new RedisRunner().port(6891).randomDir().nosave();
        RedisRunner master3 = new RedisRunner().port(6892).randomDir().nosave();
        RedisRunner slave1 = new RedisRunner().port(6900).randomDir().nosave();
        RedisRunner slave2 = new RedisRunner().port(6901).randomDir().nosave();
        RedisRunner slave3 = new RedisRunner().port(6902).randomDir().nosave();
        RedisRunner slave4 = new RedisRunner().port(6903).randomDir().nosave();
        
        ClusterRunner clusterRunner = new ClusterRunner()
                .addNode(master1, slave1, slave4)
                .addNode(master2, slave2)
                .addNode(master3, slave3);
        ClusterProcesses process = clusterRunner.run();
        
        Thread.sleep(5000); 
        
        Config config = new Config();
        config.useClusterServers()
        .setLoadBalancer(new RandomLoadBalancer())
        .addNodeAddress(process.getNodes().stream().findAny().get().getRedisServerAddressAndPort());
        RedissonClient redisson = Redisson.create(config);
       
        RedisProcess master = process.getNodes().stream().filter(x -> x.getRedisServerPort() == master1.getPort()).findFirst().get();
        
        List<RFuture<?>> futures = new ArrayList<RFuture<?>>();
        CountDownLatch latch = new CountDownLatch(1);
        Thread t = new Thread() {
            public void run() {
                for (int i = 0; i < 2000; i++) {
                    RFuture<?> f1 = redisson.getBucket("i" + i).getAsync();
                    RFuture<?> f2 = redisson.getBucket("i" + i).setAsync("");
                    RFuture<?> f3 = redisson.getTopic("topic").publishAsync("testmsg");
                    futures.add(f1);
                    futures.add(f2);
                    futures.add(f3);
                    try {
                        Thread.sleep(100);
                    } catch (InterruptedException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }
                    if (i % 100 == 0) {
                        System.out.println("step: " + i);
                    }
                }
                latch.countDown();
            };
        };
        t.start();
        t.join(1000);

        Set<InetSocketAddress> addresses = new HashSet<>();
        Collection<ClusterNode> masterNodes = redisson.getClusterNodesGroup().getNodes(NodeType.MASTER);
        for (ClusterNode clusterNode : masterNodes) {
            addresses.add(clusterNode.getAddr());
        }
        
        master.stop();
        System.out.println("master " + master.getRedisServerAddressAndPort() + " has been stopped!");
        
        Thread.sleep(TimeUnit.SECONDS.toMillis(80));
        
        RedisProcess newMaster = null;
        Collection<ClusterNode> newMasterNodes = redisson.getClusterNodesGroup().getNodes(NodeType.MASTER);
        for (ClusterNode clusterNode : newMasterNodes) {
            if (!addresses.contains(clusterNode.getAddr())) {
                newMaster = process.getNodes().stream().filter(x -> x.getRedisServerPort() == clusterNode.getAddr().getPort()).findFirst().get();
                break;
            }
            System.out.println("new-master: " + clusterNode.getAddr());
        }
                
        Thread.sleep(50000);
        
        newMaster.stop();

        System.out.println("new master " + newMaster.getRedisServerAddressAndPort() + " has been stopped!");
        
        Thread.sleep(TimeUnit.SECONDS.toMillis(70));
        
        Thread.sleep(60000);

        latch.await();

        int errors = 0;
        int success = 0;
        int readonlyErrors = 0;
        
        for (RFuture<?> rFuture : futures) {
            rFuture.awaitUninterruptibly();
            if (!rFuture.isSuccess()) {
                errors++;
            } else {
                success++;
            }
        }
        
        System.out.println("errors " + errors + " success " + success);

        for (RFuture<?> rFuture : futures) {
            if (rFuture.isSuccess()) {
                System.out.println(rFuture.isSuccess());
            } else {
                rFuture.cause().printStackTrace();
            }
        }
        
        assertThat(readonlyErrors).isZero();
        
        redisson.shutdown();
        process.shutdown();
    }
@mrniko mrniko removed this from the 2.12.3 milestone Jun 28, 2018
@ericwu1

This comment has been minimized.

Copy link
Author

@ericwu1 ericwu1 commented Jul 2, 2018

Ok, let me try and test w/ the 3.7.2 version and get back to you, thanks!

@mrniko

This comment has been minimized.

Copy link
Member

@mrniko mrniko commented Aug 21, 2018

@ericwu1

Any update?

@ericwu1

This comment has been minimized.

Copy link
Author

@ericwu1 ericwu1 commented Aug 23, 2018

It seems that after updating to use the latest version I stopped seeing this problem. This can be closed out. Thanks

@ericwu1 ericwu1 closed this Aug 23, 2018
@mrniko mrniko added this to the 2.7.2 milestone Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.