Redis cluster rebalance fails when trying to shard away from large number of nodes #4592

mjh1 · 2018-01-10T13:43:51Z

redis version 3.2.10, 4.0.6
redis gem version 3.3.3

I'm seeing the rebalance command fail when moving slots away from nodes in a 60 node redis cluster (no replication), the error is:

...
Moving 95 slots from 172.31.5.105:6379 to 172.31.10.118:6379
###############################################################################################
Moving 95 slots from 172.31.5.105:6379 to 172.31.6.80:6379
###############################################################################################
Moving 61 slots from 172.31.5.105:6379 to 172.31.0.211:6379
############################################################stderr: /usr/local/rvm/gems/ruby-2.4.1/gems/redis-3.3.3/lib/redis/client.rb:121:in `call': ERR Please use SETSLOT only with masters. (Redis::CommandError)
	from /usr/local/rvm/gems/ruby-2.4.1/gems/redis-3.3.3/lib/redis.rb:2705:in `block in method_missing'
	from /usr/local/rvm/gems/ruby-2.4.1/gems/redis-3.3.3/lib/redis.rb:58:in `block in synchronize'
	from /usr/local/rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'
	from /usr/local/rvm/gems/ruby-2.4.1/gems/redis-3.3.3/lib/redis.rb:58:in `synchronize'
	from /usr/local/rvm/gems/ruby-2.4.1/gems/redis-3.3.3/lib/redis.rb:2704:in `method_missing'

stderr: 	from /usr/local/src/redis-3.2.10/src/redis-trib.rb:958:in `block in move_slot'
	from /usr/local/src/redis-3.2.10/src/redis-trib.rb:956:in `each'
	from /usr/local/src/redis-3.2.10/src/redis-trib.rb:956:in `move_slot'
	from /usr/local/src/redis-3.2.10/src/redis-trib.rb:1115:in `block in rebalance_cluster_cmd'
	from /usr/local/src/redis-3.2.10/src/redis-trib.rb:1114:in `each'
	from /usr/local/src/redis-3.2.10/src/redis-trib.rb:1114:in `rebalance_cluster_cmd'
	from /usr/local/src/redis-3.2.10/src/redis-trib.rb:1701:in `<main>'

For some reason one of the nodes gets converted to a slave and so SETSLOT fails, I then have to remove the slave node and try again.

I'm using the redis-trib rebalance command to move hash slots away from nodes for the purposes of scaling down, e.g. scaling down by 3 nodes, pick 3 nodes A, B and C, the command would look like:

redis-trib.rb rebalance --pipeline 1000 --timeout 60000 --weight A_node_id=0 --weight B_node_id=0 --weight C_node_id=0

I've only seen this happen with larger clusters 50+ nodes and it seems to be more likely to happen the more nodes you're assigning weight zero, but I have still seen it happen when removing 10 nodes from a 60 node cluster.

The workaround seems to be to try again with a smaller number of nodes set to weight zero.

The text was updated successfully, but these errors were encountered:

fuyuanpai · 2018-07-02T08:49:51Z

I have exactly the same problem.
When all slots has been moved, the master node will become slave, and cause the SETSLOT failure.
Is there any progress on this issue?

fuyuanpai · 2018-07-02T09:57:37Z

Same issue as #3083

@antirez
This issue seems not be fixed

antirez · 2018-07-02T16:40:06Z

Ping @artix75

nhuffman-brightcove · 2018-09-05T16:38:40Z

Reproduced during removal of a single node from a 3-node cluster (no replication):

cluster nodes

b531b6a1036e0766ec9bddaea1e4ed2db527b2b3 172.31.7.148:6379@16379 master - 0 1535601698109 1466 connected 0-2730 5263-5465 5544-5621 5700-5764 5808-8192 10924-12288 12563-13108 14321-14563 14642-15217
88ad38e7be439484f6b319c3044db8199469e08b 172.31.47.12:6379@16379 myself,master - 0 1535515481000 1465 connected 2731-5262 5466-5543 5622-5699 5765-5807 8193-10923 12289-12562 13109-14320 14564-14641 15218-16383

/usr/local/src/redis-4.0.6/src/redis-trib.rb add-node 172.31.30.67:6379 172.31.47.12:6379

>>> Adding node 172.31.30.67:6379 to cluster 172.31.47.12:6379
>>> Performing Cluster Check (using node 172.31.47.12:6379)
M: 88ad38e7be439484f6b319c3044db8199469e08b 172.31.47.12:6379
slots:2731-5262,5466-5543,5622-5699,5765-5807,8193-10923,12289-12562,13109-14320,14564-14641,15218-16383 (8192 slots) master
0 additional replica(s)
M: b531b6a1036e0766ec9bddaea1e4ed2db527b2b3 172.31.7.148:6379
slots:0-2730,5263-5465,5544-5621,5700-5764,5808-8192,10924-12288,12563-13108,14321-14563,14642-15217 (8192 slots) master
0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.31.30.67:6379 to make it join the cluster.
[OK] New node added correctly.

add-node exited with code 0

/usr/local/src/redis-4.0.6/src/redis-trib.rb rebalance --use-empty-masters --pipeline 1000 --timeout 60000 172.31.47.12:6379

>>> Performing Cluster Check (using node 172.31.47.12:6379)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Rebalancing across 3 nodes. Total weight = 3
Moving 2731 slots from 172.31.7.148:6379 to 172.31.30.67:6379
###########################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################
Moving 2731 slots from 172.31.47.12:6379 to 172.31.30.67:6379
###########################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

rebalance exited with code 0

cluster nodes

cec9f1ba823bb6f7654ccb05709dcda6c9d4fa56 172.31.30.67:6379@16379 master - 0 1535601980498 1467 connected 0-5262 5466-5543 5622-5699 5765-5807
88ad38e7be439484f6b319c3044db8199469e08b 172.31.47.12:6379@16379 master - 0 1535601979498 1465 connected 8193-10923 12289-12562 13109-14320 14564-14641 15218-16383
b531b6a1036e0766ec9bddaea1e4ed2db527b2b3 172.31.7.148:6379@16379 myself,master - 0 1535601978000 1466 connected 5263-5465 5544-5621 5700-5764 5808-8192 10924-12288 12563-13108 14321-14563 14642-15217

/usr/local/src/redis-4.0.6/src/redis-trib.rb rebalance --use-empty-masters --pipeline 1000 --timeout 60000 --weight 88ad38e7be439484f6b319c3044db8199469e08b=0 172.31.30.67:637

>>> Performing Cluster Check (using node 172.31.30.67:6379)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Rebalancing across 3 nodes. Total weight = 2.0
Moving 2731 slots from 172.31.47.12:6379 to 172.31.7.148:6379
###########################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################
Moving 2730 slots from 172.31.47.12:6379 to 172.31.30.67:6379
#########################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################stderr: /usr/local/share/ruby/gems/2.0/gems/redis-3.3.3/lib/redis/client.rb:121:in `call': ERR Please use SETSLOT only with masters. (Redis::CommandError)
from /usr/local/share/ruby/gems/2.0/gems/redis-3.3.3/lib/redis.rb:2705:in `block in method_missing'
from /usr/local/share/ruby/gems/2.0/gems/redis-3.3.3/lib/redis.rb:58:in `block in synchronize'
from /usr/share/ruby/2.0/monitor.rb:211:in `mon_synchronize'

stderr: from /usr/local/share/ruby/gems/2.0/gems/redis-3.3.3/lib/redis.rb:58:in `synchronize'
from /usr/local/share/ruby/gems/2.0/gems/redis-3.3.3/lib/redis.rb:2704:in `method_missing'
from /usr/local/src/redis-4.0.6/src/redis-trib.rb:958:in `block in move_slot'
from /usr/local/src/redis-4.0.6/src/redis-trib.rb:956:in `each'
from /usr/local/src/redis-4.0.6/src/redis-trib.rb:956:in `move_slot'
from /usr/local/src/redis-4.0.6/src/redis-trib.rb:1115:in `block in rebalance_cluster_cmd'
from /usr/local/src/redis-4.0.6/src/redis-trib.rb:1114:in `each'
from /usr/local/src/redis-4.0.6/src/redis-trib.rb:1114:in `rebalance_cluster_cmd'
from /usr/local/src/redis-4.0.6/src/redis-trib.rb:1700:in `<main>'

rebalance exited with code 1 after error during move of final slot.

cluster nodes

cec9f1ba823bb6f7654ccb05709dcda6c9d4fa56 172.31.30.67:6379@16379 master - 0 1535602030000 1469 connected 0-5262 5466-5543 5622-5699 5765-5807 12289-12562 13109-14320 14564-14641 15218-16383
88ad38e7be439484f6b319c3044db8199469e08b 172.31.47.12:6379@16379 slave cec9f1ba823bb6f7654ccb05709dcda6c9d4fa56 0 1535602030591 1469 connected
b531b6a1036e0766ec9bddaea1e4ed2db527b2b3 172.31.7.148:6379@16379 myself,master - 0 1535602028000 1468 connected 5263-5465 5544-5621 5700-5764 5808-12288 12563-13108 14321-14563 14642-15217

nhuffman-brightcove · 2018-09-24T17:08:51Z

We were able to work around this issue by modifying redis-trib.rb to reload node info during move_slot. With this modification the node whose slots are all removed is still converted from a master to a slave, but the rebalance no longer fails with the Please use SETSLOT only with masters error: Add n.load_info above line https://github.com/antirez/redis/blob/4.0/src/redis-trib.rb#L1087

That said, I'm not sure how or why this works... With this change, all slots are removed from the node, it is converted to a slave, and can safely be removed from the cluster and terminated. But... if the node doesn't change from a master to a slave until it has no slots assigned, then move_slot should not be getting called to move a slot away from it. And if there's a slot to move still, and we just skip moving the slot because the node has turned into a slave, in theory the slot should still be assigned to the master-turned-slave, which it is not.

…, which will crash if the node in question has become a slave

nhuffman-brightcove mentioned this issue Jul 6, 2018

'redis-trib fix' results in persistent 'Nodes don't agree about configuration' when both master and slave go down with allocated slots #4375

Open

nhuffman-brightcove mentioned this issue Sep 6, 2018

Add cluster-allow-replica-migration option. #5285

Merged

funny-falcon mentioned this issue Sep 7, 2018

Please, do not automatically reassing slave when master became empty. #4896

Closed

funny-falcon mentioned this issue Sep 24, 2018

Add test for fixing failed slot migration, and fix fixing tool. #5270

Closed

tsein-bc added a commit to tsein-bc/redis that referenced this issue Oct 9, 2018

Fixing issue redis#4592 by loading node info before running move_slot…

78e1cb4

…, which will crash if the node in question has become a slave

tsein-bc mentioned this issue Oct 9, 2018

Fixing issue #4592 by loading node info before running move_slot, whi… #5432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis cluster rebalance fails when trying to shard away from large number of nodes #4592

Redis cluster rebalance fails when trying to shard away from large number of nodes #4592

mjh1 commented Jan 10, 2018

fuyuanpai commented Jul 2, 2018

fuyuanpai commented Jul 2, 2018 •

edited

antirez commented Jul 2, 2018

nhuffman-brightcove commented Sep 5, 2018

nhuffman-brightcove commented Sep 24, 2018

Redis cluster rebalance fails when trying to shard away from large number of nodes #4592

Redis cluster rebalance fails when trying to shard away from large number of nodes #4592

Comments

mjh1 commented Jan 10, 2018

fuyuanpai commented Jul 2, 2018

fuyuanpai commented Jul 2, 2018 • edited

antirez commented Jul 2, 2018

nhuffman-brightcove commented Sep 5, 2018

nhuffman-brightcove commented Sep 24, 2018

fuyuanpai commented Jul 2, 2018 •

edited