-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jedis cluster info rediscovering queueing on high load #1248
Comments
@Spikhalskiy nice finding. Would you like to craft a PR with the solution you proposed?. I believe it's the right way to proceed. |
@marcosnils Yes, if you think proposal looks fine, I would prepare PR. |
@Spikhalskiy what do you think if we just skip discovering instead of locking if it's already happening?. I believe that it's better than to lock and wait until it finishes even though it shouldn't take a lot of time. This way we can respond to MOVED queries even though discovery is happening. public void discoverClusterSlots(Jedis jedis) {
if (!discovering) {
w.lock();
discovering = true;
try {
this.slots.clear();
List<Object> slots = jedis.clusterSlots();
for (Object slotInfoObj : slots) {
List<Object> slotInfo = (List<Object>) slotInfoObj;
if (slotInfo.size() <= 2) {
continue;
}
List<Integer> slotNums = getAssignedSlotArray(slotInfo);
// hostInfos
List<Object> hostInfos = (List<Object>) slotInfo.get(2);
if (hostInfos.size() <= 0) {
continue;
}
// at this time, we just use master, discard slave information
HostAndPort targetNode = generateHostAndPort(hostInfos);
setNodeIfNotExist(targetNode);
assignSlotsToNode(slotNums, targetNode);
}
} finally {
discovering = false;
w.unlock();
}
}
} This way Jedis can still answer requests even though discovery is taking place. Those requests will be penalized a bit because Jedis will have to redirect to the correct node, but at least we don't block everything until discovery finishes. @HeartSaVioR ping? |
@marcosnils Yep, it should be fine, especially because we will acquire read lock in any case on next attempt to access redis node after getting MOVED on previous attempt. + I want to make currentRediscoverProcessType int type and write "discovering type" in this var, because #discoverClusterSlots in progress shouldn't cancel discoverClusterNodesAndSlots. Will submit initial PR shortly. |
First item from proposal implemented to fix problem asap. Second item about don't rediscover at all on single MOVED could be done as a separate PR. |
@Spikhalskiy I don't believe this makes too much sense. The Redis Cluster spec states that when a MOVED occurs a complete cluster rediscovery must be held because it's very unlikely for a node to move just one slot to a different server. Slots are usually moved for two reasons mainly.
I believe we can close this issue by now. thoughts? |
Ok, especially if it's recommended behavior by spec, let's leave it as it's implemented now. Thanks! |
Jedis queues cluster rediscovering using lock currently. It could cause unpredictable latencies in calls to Jedis + useless increased load to redis nodes. We got 5-6 seconds with soTimeout 5ms and huge amount of threads stucked in rediscovering and acquiring lock.
Expected behavior
If we have thousands requests to JedisCluster and we got MOVED from Redis Cluster on 500 of them in a very short time - we should rediscover cluster only once.
Actual behavior
It's possible that we will rediscover cluster 500 times and everything will wait.
We had like 5 seconds waiting with redis in local network for sure.
Steps to reproduce
Perhaps, create fresh cluster and start fill it with very high volume with large number of threads from multiple client to stimulate slots moving -
Jedis version:
2.8.1 and current master
Redis version:
3.0.0
Code investigation
Let's imagine, we have 1000 threads writing with Jedis and 200 threads got MOVED on the same time. After that all of them will call JedisClusterCommand#137
this.connectionHandler.renewSlotCache(connection);
and all of them got to JedisClusterInfoCache#discoverClusterSlots. As a result, all of them one by one will require lock and rebuild cluster info. Threads that will try to do something even with non-moved slots will wait on acquiring read lock in JedisClusterInfoCache.Proposed solution
The text was updated successfully, but these errors were encountered: