[CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 #12695

salarali · 2023-10-25T23:03:51Z

Crash report

=== REDIS BUG REPORT START: Cut & paste starting from here ===
12758:M 27 Oct 2023 04:19:55.632 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.632 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.633 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.633 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.634 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.634 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.635 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.635 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.636 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.636 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.637 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.637 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.639 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.639 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.640 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.640 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.642 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.642 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.644 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.644 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]
...

Assertion failed error keeps going on for a long time

Additional information

OS distribution and version
Amazon Linux 2023, redis-version 7.2.2
Steps to reproduce (if any)

Running a rebalance command between a cluster with a mixture of 7.0.11 and new 7.2.2 nodes

redis-cli --cluster rebalance xxx:6379 --cluster-use-empty-masters --cluster-pipeline 1000 --cluster-weight d9a5864cf277e6f6cb21ea60a3cf0015ddf662a3=0

The rebalance commands get stuck and on investigation, this assertion was found.

The text was updated successfully, but these errors were encountered:

salarali · 2023-11-19T19:45:04Z

Any updates on this?

enjoy-binbin · 2023-11-24T09:21:12Z

@salarali thanks for the report, i am taking a look. Although it's a bit tortuous, I found a way to reproduce it.

enjoy-binbin · 2023-11-27T02:35:20Z

@PingXie since you are here, can you also take a look?

this somehow like #12805, if the node is a master, we may need to add it to the shard lins.

if (ext_shardid == NULL) clusterAddNodeToShard(sender->shard_id, sender);

the reason for the issue is, like if we have A (7.2) -> B (7.0), B is A master

in node A view, A does not know the B's shard id, so in here, we are not able to clear the B's slot_info.

void addShardReplyForClusterShards(client *c, list *nodes) {
    ...
    addReplyBulkCString(c, "nodes");
    addReplyArrayLen(c, listLength(nodes));
    listIter li;
    listRewind(nodes, &li);
    for (listNode *ln = listNext(&li); ln != NULL; ln = listNext(&li)) {
        clusterNode *n = listNodeValue(ln);
        addNodeDetailsToShardReply(c, n);
        clusterFreeNodesSlotsInfo(n);
    }

But in here, we will keep increasing B's slot info, and eventually encountering the assert:

void clusterGenNodesSlotsInfo(int filter) {
    ...
        /* Generate slots info when occur different node with start
         * or end of slot. */
        if (i == CLUSTER_SLOTS || n != server.cluster->slots[i]) {
            if (!(n->flags & filter)) {
                if (!n->slot_info_pairs) {
                    n->slot_info_pairs = zmalloc(2 * n->numslots * sizeof(uint16_t));
                }
                serverAssert((n->slot_info_pairs_count + 1) < (2 * n->numslots));
                n->slot_info_pairs[n->slot_info_pairs_count++] = start;
                n->slot_info_pairs[n->slot_info_pairs_count++] = i-1;
            }
            if (i == CLUSTER_SLOTS) break;
            n = server.cluster->slots[i];
            start = i;
        }
}

PingXie · 2023-11-27T03:52:00Z

@enjoy-binbin, it looks like your fix for #12805 might resolve this issue too. With that fix in place, every 7.2 node (like A in your example) should keep its shard view consistent. And if there's another replica in the mix, like A', in the same shard as A and B, it'll follow the same shard structure, though with a different ID. That's totally fine for a setup with different versions running. From what we talked about in #12805, once everyone's on 7.2, we should see the shard structures and IDs line up. That's when the shard IDs will really start to make sense.

Btw, is re-sharding required to trigger this bug? Generally, I'd lean towards updating all nodes to the same version before doing something as involved as re-sharding. Most folks update the whole cluster first, which is a good call – it keeps things straightforward and avoids the quirks you might run into with a mixed-version setup.

enjoy-binbin · 2023-11-27T03:55:46Z

the fix in #12805 won't help, since we will first check whether sender's shard_id has changed, if it changed, we will add it to the shard list. and in this case, if the sender is a master, we won't change the shard_id, so we are not able to add it the the shard lins

PingXie · 2023-11-27T04:22:08Z

in node A view, A does not know the B's shard id, so in here, we are not able to clear the B's slot_info.

When a 7.2 node A replicates from a 7.0 (primary) node B, A should inherit B's shard id, even it is randomly generated on node A. With the fix for #12805, I'd assume node B's shard ID remain stable on node A hence these two will remain in the same shard. So the statement above is not my understanding.

What are the repro steps? Or is it possible for you to share the core dump somehow? It is a bit hard to be certain just by looking at the source code.

enjoy-binbin · 2023-11-27T04:58:50Z

my reproduce step:

step A:
7.0 cluster
./utils/create-cluster/create-cluster stop && ./utils/create-cluster/create-cluster clean
./utils/create-cluster/create-cluster start && ./utils/create-cluster/create-cluster create

step B:
7.2 node
rm -rf nodes.conf && ./src/redis-server redis.conf --cluster-enabled yes --port 7000

step C:
./src/redis-cli -p 30001 cluster meet 127.0.0.1 7000
./src/redis-cli --cluster rebalance 127.0.0.1:30001 --cluster-use-empty-masters --cluster-pipeline 1000 --cluster-weight d1a8056b89c9790a6ad836320e7c6f7dfc9fd282=0

step D:
./src/redis-cli -p 7000 cluster shards

I should be doing the C and D step repeatedly, and CLUSTER SHARDS will response with this (we can see the slots section keeps expanding):

1) 1) "slots"            
   2)  1) (integer) 0       
       2) (integer) 5461
       3) (integer) 0                                  
       4) (integer) 5461
       5) (integer) 0       
       6) (integer) 5461
       7) (integer) 0   
       8) (integer) 5461         
       9) (integer) 0       
      10) (integer) 5461
      11) (integer) 0 
      12) (integer) 5461         
   3) "nodes"               
   4) 1)  1) "id"    
          2) "a447baa52a5d5564fbf27974e93e4f0825746d00"
          3) "port"                                    
          4) (integer) 30004
          5) "ip"          
          6) "127.0.0.1"
          7) "endpoint"                                
          8) "127.0.0.1"
          9) "role"        
         10) "replica"
         11) "replication-offset"
         12) (integer) 13566     
         13) "health"     
         14) "online"

The steps to reproduce are quite messy. I upgraded randomly locally.

yean, with the fix for #12805, there shard id will remain in the same shard. However, the shard id of a certain master has not been added to the shardi d list. Maybe there is something missing somewhere.

# the code in here will check memcmp, and since the sender' shard id is not changed, 
# so we won't add it to the shard id list
static void updateShardId(clusterNode *node, const char *shard_id) {
    if (shard_id && memcmp(node->shard_id, shard_id, CLUSTER_NAMELEN) != 0) {
        clusterRemoveNodeFromShard(node);
        memcpy(node->shard_id, shard_id, CLUSTER_NAMELEN);
        clusterAddNodeToShard(shard_id, node);
        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG);
    }
    if (shard_id && myself != node && myself->slaveof == node) {
        if (memcmp(myself->shard_id, shard_id, CLUSTER_NAMELEN) != 0) {
            /* shard-id can diverge right after a rolling upgrade
             * from pre-7.2 releases */
            clusterRemoveNodeFromShard(myself);
            memcpy(myself->shard_id, shard_id, CLUSTER_NAMELEN);
            clusterAddNodeToShard(shard_id, myself);
            clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG);
        }
    }
}

That’s why I plan to add this:

if (ext_shardid == NULL and is_master) clusterAddNodeToShard(sender->shard_id, sender);

So the essential reason is that in the shard id list, that is, server.cluster->shards, #12805 will only add the replica, but not the master, causing this problem

PingXie · 2023-11-27T06:22:19Z

Got it, this sounds more like a case where a 7.2 node is just observing a 7.0 shard, not actually replicating from it. Makes me wonder if we even need to go through re-sharding to reproduce this bug?

with the fix for #12805, there shard id will remain in the same shard. However, the shard id of a certain master has not been added to the shardi d list. Maybe there is something missing somewhere.

Did you get a chance to try this out with your latest changes in #12805? The earlier commits had this issue, but I thought your last update would've fixed it for both v7.0 primary and replica nodes.

enjoy-binbin · 2023-11-27T06:33:36Z

Makes me wonder if we even need to go through re-sharding to reproduce this bug?

I feel it's not needed, i just follow the issue idea, and then i got an env that could be reproduced stably and didn't want to break it.

Did you get a chance to try this out with your latest changes in #12805? The earlier commits had this issue, but I thought your last update would've fixed it for both v7.0 primary and replica nodes.

i did try it (even earlier), it don't work. the new commit indeed will call updateShardId, but the memcmp check will prevent us add it to the shard id list.

PingXie · 2023-11-27T07:06:47Z

You are right. Now I see two potential issues with updateShardId

Guarding the shard dictionary update on the shard id change. These should've been two orthogonal decisions.
The duplication of the shard dictionary update logic for both the incoming node n and myself

Do you like to propose a change?

enjoy-binbin · 2023-11-27T07:19:18Z

i am happy to make the change and test it, but a little confused. Do you have any ideas, or can you elaborate a bit more?

PingXie · 2023-11-29T04:43:06Z

Thinking about this more, a better and (more) correct fix would to update the shard topology when a 7.0 replica is connected to its 7.0 primary for the very first time. More specifically, we need to inject a updateShardId(sender, master->shard_id) call on cluster_legacy.c:2947. This should fix the issue.

enjoy-binbin · 2023-11-29T08:11:47Z

we need to inject a updateShardId(sender, master->shard_id) call

I tried it, it didn't work.

Luckily I found the minimal steps to reproduce:

# 7.0 cluster
./utils/create-cluster/create-cluster stop && ./utils/create-cluster/create-cluster clean
./utils/create-cluster/create-cluster start && ./utils/create-cluster/create-cluster create

# 7.2 node
rm -rf nodes.conf && ./src/redis-server redis.conf --cluster-enabled yes --port 6379

# 7.2 node replicate with a 7.0 node
./src/redis-cli -p 30001 cluster meet 127.0.0.1 6379
./src/redis-cli -p 6379 cluster replicate d808332a59dc44d4cf8cd0f54c2ea18e34ded7fa
./src/redis-cli -p 6379 cluster shards && ./src/redis-cli -p 6379 cluster shards

so the reason is that a 7.2 node is a 7.0 node 's slave, the shard id dict does not have the 7.0 node, so when we issue cluster shards in 7.2 node, it went here: #12695 (comment)

PingXie · 2023-12-04T08:24:55Z

so the reason is that a 7.2 node is a 7.0 node 's slave, the shard id dict does not have the 7.0 node, so when we issue cluster shards in 7.2 node, it went here: #12695 (comment)

This looks like a different issue than observed in your previous #12695 (comment). I think we should still keep the change I proposed in #12695 (comment).

We could continue with the fix proposed in #12695 (comment) but an alternative could be fixing the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. In this sense, I wonder if we should consider a fix of calling clusterAddNodeToShard at https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L2161 and https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L1615, when the node in question is a primary and if its shard_id is not in the shard dict yet.

enjoy-binbin · 2023-12-04T08:32:02Z

I wonder if we should consider a fix of calling clusterAddNodeToShard at https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L2161 and https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L1615, when the node in question is a primary and if its shard_id is not in the shard dict yet.

Wow, that's actually what I thought at first, shard dict should be always in sync with the node's shard_id. The first time I tried it, it didn't fix the issue, so I dropped it and considered fixing it with a smaller diff.

I actually agree with this idea. We didn’t add shard dict synchronously in some places, which feels like a hidden danger.

I will try to open a new PR later to add all these changes we mentioned (for better review)

…e not sync Crash reported in redis#12695. In the process of upgrading the cluster from 7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we will rely on shard id to build the server.cluster->shards dict. In some cases, for example, the 7.0 master node and the 7.2 replica node. From the view of 7.2 replica node, the cluster->shards dictionary does not have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica node may crash. A CLUSTER SHARDS result output: ``` 1) 1) "slots" 2) 1) (integer) 0 2) (integer) 5461 3) (integer) 0 4) (integer) 5461 5) (integer) 0 ``` We can see that the output contains repeated slots, and each call will append a new one, and then crash on serverAssert: ```c void clusterGenNodesSlotsInfo(int filter) { ... /* Generate slots info when occur different node with start * or end of slot. */ if (i == CLUSTER_SLOTS || n != server.cluster->slots[i]) { if (!(n->flags & filter)) { if (!n->slot_info_pairs) { n->slot_info_pairs = zmalloc(2 * n->numslots * sizeof(uint16_t)); } serverAssert((n->slot_info_pairs_count + 1) < (2 * n->numslots)); n->slot_info_pairs[n->slot_info_pairs_count++] = start; n->slot_info_pairs[n->slot_info_pairs_count++] = i-1; } if (i == CLUSTER_SLOTS) break; n = server.cluster->slots[i]; start = i; } ... } ``` The reason is that in addShardReplyForClusterShards we are not able to clean up the slot_info_pairs corresponding to the 7.0 master node. In the code below, we will loop to find the 7.0 master node, and then we will call clusterFreeNodesSlotsInfo to clean up slot_info_pairs according to the shard id dict list, but the 7.0 master node is not in the list. ```c void addShardReplyForClusterShards(client *c, list *nodes) { ... /* Use slot_info_pairs from the primary only */ while (n->slaveof != NULL) n = n->slaveof; ... addReplyBulkCString(c, "nodes"); addReplyArrayLen(c, listLength(nodes)); listIter li; listRewind(nodes, &li); for (listNode *ln = listNext(&li); ln != NULL; ln = listNext(&li)) { clusterNode *n = listNodeValue(ln); addNodeDetailsToShardReply(c, n); clusterFreeNodesSlotsInfo(n); } } ``` We should fix the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. The fix was suggested by PingXie, see more details in redis#12695. Co-authored-by: Ping Xie <pingxie@google.com>

enjoy-binbin · 2023-12-05T08:09:36Z

@PingXie thanks! I verified that adding it to clusterRenameNode can fix this issue #12695 (comment). please take a look with the fix #12832.

…e not sync (#12832) Crash reported in #12695. In the process of upgrading the cluster from 7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we will rely on shard id to build the server.cluster->shards dict. In some cases, for example, the 7.0 master node and the 7.2 replica node. From the view of 7.2 replica node, the cluster->shards dictionary does not have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica node may crash. We should fix the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. The fix was suggested by PingXie, see more details in #12695.

…e not sync (redis#12832) Crash reported in redis#12695. In the process of upgrading the cluster from 7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we will rely on shard id to build the server.cluster->shards dict. In some cases, for example, the 7.0 master node and the 7.2 replica node. From the view of 7.2 replica node, the cluster->shards dictionary does not have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica node may crash. We should fix the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. The fix was suggested by PingXie, see more details in redis#12695. (cherry picked from commit 5b0c6a8)

…e not sync (#12832) Crash reported in #12695. In the process of upgrading the cluster from 7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we will rely on shard id to build the server.cluster->shards dict. In some cases, for example, the 7.0 master node and the 7.2 replica node. From the view of 7.2 replica node, the cluster->shards dictionary does not have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica node may crash. We should fix the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. The fix was suggested by PingXie, see more details in #12695. (cherry picked from commit 5b0c6a8)

…e not sync (redis#12832) Crash reported in redis#12695. In the process of upgrading the cluster from 7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we will rely on shard id to build the server.cluster->shards dict. In some cases, for example, the 7.0 master node and the 7.2 replica node. From the view of 7.2 replica node, the cluster->shards dictionary does not have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica node may crash. We should fix the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. The fix was suggested by PingXie, see more details in redis#12695.

salarali changed the title ~~[CRASH] Assertion Failed when running rebalance command~~ [CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 Oct 27, 2023

enjoy-binbin mentioned this issue Dec 5, 2023

Fix CLUSTER SHARDS crash in 7.0/7.2 mixed clusters where shard ids are not sync #12832

Merged

enjoy-binbin linked a pull request Dec 5, 2023 that will close this issue

Fix CLUSTER SHARDS crash in 7.0/7.2 mixed clusters where shard ids are not sync #12832

Merged

madolson closed this as completed in #12832 Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 #12695

[CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 #12695

salarali commented Oct 25, 2023 •

edited

Loading

salarali commented Nov 19, 2023

enjoy-binbin commented Nov 24, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023 •

edited

Loading

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 29, 2023

enjoy-binbin commented Nov 29, 2023 •

edited

Loading

PingXie commented Dec 4, 2023

enjoy-binbin commented Dec 4, 2023

enjoy-binbin commented Dec 5, 2023

[CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 #12695

[CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 #12695

Comments

salarali commented Oct 25, 2023 • edited Loading

salarali commented Nov 19, 2023

enjoy-binbin commented Nov 24, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023 • edited Loading

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 27, 2023

enjoy-binbin commented Nov 27, 2023

PingXie commented Nov 29, 2023

enjoy-binbin commented Nov 29, 2023 • edited Loading

PingXie commented Dec 4, 2023

enjoy-binbin commented Dec 4, 2023

enjoy-binbin commented Dec 5, 2023

salarali commented Oct 25, 2023 •

edited

Loading

enjoy-binbin commented Nov 27, 2023 •

edited

Loading

enjoy-binbin commented Nov 29, 2023 •

edited

Loading