Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jedis cluster info rediscovering queueing on high load #1248

Closed
Spikhalskiy opened this issue Apr 1, 2016 · 7 comments
Closed

Jedis cluster info rediscovering queueing on high load #1248

Spikhalskiy opened this issue Apr 1, 2016 · 7 comments

Comments

@Spikhalskiy
Copy link
Contributor

Jedis queues cluster rediscovering using lock currently. It could cause unpredictable latencies in calls to Jedis + useless increased load to redis nodes. We got 5-6 seconds with soTimeout 5ms and huge amount of threads stucked in rediscovering and acquiring lock.

Expected behavior

If we have thousands requests to JedisCluster and we got MOVED from Redis Cluster on 500 of them in a very short time - we should rediscover cluster only once.

Actual behavior

It's possible that we will rediscover cluster 500 times and everything will wait.
We had like 5 seconds waiting with redis in local network for sure.

Steps to reproduce

Perhaps, create fresh cluster and start fill it with very high volume with large number of threads from multiple client to stimulate slots moving -

Jedis version:

2.8.1 and current master

Redis version:

3.0.0

Code investigation

Let's imagine, we have 1000 threads writing with Jedis and 200 threads got MOVED on the same time. After that all of them will call JedisClusterCommand#137 this.connectionHandler.renewSlotCache(connection); and all of them got to JedisClusterInfoCache#discoverClusterSlots. As a result, all of them one by one will require lock and rebuild cluster info. Threads that will try to do something even with non-moved slots will wait on acquiring read lock in JedisClusterInfoCache.

Proposed solution

  1. Change lock structure in JedisClusterInfoCache#discoverClusterNodesAndSlots and JedisClusterInfoCache#discoverClusterSlots to something like...
AtomicBoolean rebuilding = new AtomicBoolean();

public void discoverClusterSlots(Jedis jedis) {
    if (rebuilding.get()) {
      r.lock(); //wait end of rebuilding by another process
      r.unlock();
    } else {
      w.lock();
      rebuilding.set(true);
      try {
        //old actual discovering code 
      } finally {
        rebuilding.set(false);
        w.unlock();
      }
    }
}
  1. Use information from MOVED and move only one slot; don't make additional external calls to redis and rediscover full cluster info in acquired lock.
@Spikhalskiy Spikhalskiy changed the title Jedis cluster rediscovering queueing on high load Jedis cluster info rediscovering queueing on high load Apr 1, 2016
@marcosnils
Copy link
Contributor

@Spikhalskiy nice finding. Would you like to craft a PR with the solution you proposed?. I believe it's the right way to proceed.

@Spikhalskiy
Copy link
Contributor Author

@marcosnils Yes, if you think proposal looks fine, I would prepare PR.

@marcosnils
Copy link
Contributor

@Spikhalskiy what do you think if we just skip discovering instead of locking if it's already happening?. I believe that it's better than to lock and wait until it finishes even though it shouldn't take a lot of time.

This way we can respond to MOVED queries even though discovery is happening.

public void discoverClusterSlots(Jedis jedis) {
    if (!discovering) {
      w.lock();
      discovering = true;
      try {
        this.slots.clear();

        List<Object> slots = jedis.clusterSlots();

        for (Object slotInfoObj : slots) {
          List<Object> slotInfo = (List<Object>) slotInfoObj;

          if (slotInfo.size() <= 2) {
            continue;
          }

          List<Integer> slotNums = getAssignedSlotArray(slotInfo);

          // hostInfos
          List<Object> hostInfos = (List<Object>) slotInfo.get(2);
          if (hostInfos.size() <= 0) {
            continue;
          }

          // at this time, we just use master, discard slave information
          HostAndPort targetNode = generateHostAndPort(hostInfos);

          setNodeIfNotExist(targetNode);
          assignSlotsToNode(slotNums, targetNode);
        }
      } finally {
        discovering = false;
        w.unlock();
      }
    }
  }

This way Jedis can still answer requests even though discovery is taking place. Those requests will be penalized a bit because Jedis will have to redirect to the correct node, but at least we don't block everything until discovery finishes.

@HeartSaVioR ping?

@Spikhalskiy
Copy link
Contributor Author

@marcosnils Yep, it should be fine, especially because we will acquire read lock in any case on next attempt to access redis node after getting MOVED on previous attempt. + I want to make currentRediscoverProcessType int type and write "discovering type" in this var, because #discoverClusterSlots in progress shouldn't cancel discoverClusterNodesAndSlots. Will submit initial PR shortly.

@Spikhalskiy
Copy link
Contributor Author

First item from proposal implemented to fix problem asap. Second item about don't rediscover at all on single MOVED could be done as a separate PR.

@marcosnils
Copy link
Contributor

  1. Use information from MOVED and move only one slot; don't make additional external calls to redis and rediscover full cluster info in acquired lock.

@Spikhalskiy I don't believe this makes too much sense. The Redis Cluster spec states that when a MOVED occurs a complete cluster rediscovery must be held because it's very unlikely for a node to move just one slot to a different server.

Slots are usually moved for two reasons mainly.

  1. You want to add new nodes to your cluster which requires resharding
  2. Some node went down and you need to move those slots somwhere else.

I believe we can close this issue by now.

thoughts?

@Spikhalskiy
Copy link
Contributor Author

Ok, especially if it's recommended behavior by spec, let's leave it as it's implemented now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants