Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 #12695

Closed
salarali opened this issue Oct 25, 2023 · 16 comments · Fixed by #12832
Closed

Comments

@salarali
Copy link

salarali commented Oct 25, 2023

Crash report

=== REDIS BUG REPORT START: Cut & paste starting from here ===
12758:M 27 Oct 2023 04:19:55.632 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.632 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.633 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.633 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.634 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.634 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.635 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.635 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.636 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.636 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.637 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.637 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.639 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.639 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.640 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.640 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.642 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.642 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]

------ INFO OUTPUT ------
12758:M 27 Oct 2023 04:19:55.644 # === ASSERTION FAILED ===
12758:M 27 Oct 2023 04:19:55.644 # ==> cluster.c:5349 '(n->slot_info_pairs_count + 1) < (2 * n->numslots)' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterGenNodesDescription+0x58)[0x4fcff8]
/usr/local/bin/redis-server *:6379 [cluster](logServerInfo+0x260)[0x4ea180]
/usr/local/bin/redis-server *:6379 [cluster](printCrashReport+0x18)[0x4ea718]
/usr/local/bin/redis-server *:6379 [cluster](_serverAssert+0x154)[0x4ea954]
/usr/local/bin/redis-server *:6379 [cluster](slowlogInit+0x0)[0x4fcf40]
/usr/local/bin/redis-server *:6379 [cluster](clusterReplyShards+0x40)[0x4fe880]
/usr/local/bin/redis-server *:6379 [cluster](call+0x14c)[0x46e3ac]
/usr/local/bin/redis-server *:6379 [cluster](processCommand+0x37c)[0x46ed3c]
/usr/local/bin/redis-server *:6379 [cluster](processInputBuffer+0xdc)[0x48dc3c]
/usr/local/bin/redis-server *:6379 [cluster](readQueryFromClient+0x2e8)[0x48e0e8]
/usr/local/bin/redis-server *:6379 [cluster][0x578388]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x108)[0x46528c]
/usr/local/bin/redis-server *:6379 [cluster](main+0x3a0)[0x45a7c0]
/lib64/libc.so.6(+0x35a78)[0xffff97887a78]
/lib64/libc.so.6(__libc_start_main+0x9c)[0xffff97887b5c]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x30)[0x45afb0]
...

Assertion failed error keeps going on for a long time

Additional information

  1. OS distribution and version
    Amazon Linux 2023, redis-version 7.2.2
  2. Steps to reproduce (if any)
  • Running a rebalance command between a cluster with a mixture of 7.0.11 and new 7.2.2 nodes
redis-cli --cluster rebalance xxx:6379 --cluster-use-empty-masters --cluster-pipeline 1000 --cluster-weight d9a5864cf277e6f6cb21ea60a3cf0015ddf662a3=0

The rebalance commands get stuck and on investigation, this assertion was found.

@salarali salarali changed the title [CRASH] Assertion Failed when running rebalance command [CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 Oct 27, 2023
@salarali
Copy link
Author

Any updates on this?

@enjoy-binbin
Copy link
Collaborator

@salarali thanks for the report, i am taking a look. Although it's a bit tortuous, I found a way to reproduce it.

@enjoy-binbin
Copy link
Collaborator

@PingXie since you are here, can you also take a look?

this somehow like #12805, if the node is a master, we may need to add it to the shard lins.

if (ext_shardid == NULL) clusterAddNodeToShard(sender->shard_id, sender);

the reason for the issue is, like if we have A (7.2) -> B (7.0), B is A master

in node A view, A does not know the B's shard id, so in here, we are not able to clear the B's slot_info.

void addShardReplyForClusterShards(client *c, list *nodes) {
    ...
    addReplyBulkCString(c, "nodes");
    addReplyArrayLen(c, listLength(nodes));
    listIter li;
    listRewind(nodes, &li);
    for (listNode *ln = listNext(&li); ln != NULL; ln = listNext(&li)) {
        clusterNode *n = listNodeValue(ln);
        addNodeDetailsToShardReply(c, n);
        clusterFreeNodesSlotsInfo(n);
    }

But in here, we will keep increasing B's slot info, and eventually encountering the assert:

void clusterGenNodesSlotsInfo(int filter) {
    ...
        /* Generate slots info when occur different node with start
         * or end of slot. */
        if (i == CLUSTER_SLOTS || n != server.cluster->slots[i]) {
            if (!(n->flags & filter)) {
                if (!n->slot_info_pairs) {
                    n->slot_info_pairs = zmalloc(2 * n->numslots * sizeof(uint16_t));
                }
                serverAssert((n->slot_info_pairs_count + 1) < (2 * n->numslots));
                n->slot_info_pairs[n->slot_info_pairs_count++] = start;
                n->slot_info_pairs[n->slot_info_pairs_count++] = i-1;
            }
            if (i == CLUSTER_SLOTS) break;
            n = server.cluster->slots[i];
            start = i;
        }
}

@PingXie
Copy link
Contributor

PingXie commented Nov 27, 2023

@enjoy-binbin, it looks like your fix for #12805 might resolve this issue too. With that fix in place, every 7.2 node (like A in your example) should keep its shard view consistent. And if there's another replica in the mix, like A', in the same shard as A and B, it'll follow the same shard structure, though with a different ID. That's totally fine for a setup with different versions running. From what we talked about in #12805, once everyone's on 7.2, we should see the shard structures and IDs line up. That's when the shard IDs will really start to make sense.

Btw, is re-sharding required to trigger this bug? Generally, I'd lean towards updating all nodes to the same version before doing something as involved as re-sharding. Most folks update the whole cluster first, which is a good call – it keeps things straightforward and avoids the quirks you might run into with a mixed-version setup.

@enjoy-binbin
Copy link
Collaborator

the fix in #12805 won't help, since we will first check whether sender's shard_id has changed, if it changed, we will add it to the shard list. and in this case, if the sender is a master, we won't change the shard_id, so we are not able to add it the the shard lins

@PingXie
Copy link
Contributor

PingXie commented Nov 27, 2023

in node A view, A does not know the B's shard id, so in here, we are not able to clear the B's slot_info.

When a 7.2 node A replicates from a 7.0 (primary) node B, A should inherit B's shard id, even it is randomly generated on node A. With the fix for #12805, I'd assume node B's shard ID remain stable on node A hence these two will remain in the same shard. So the statement above is not my understanding.

What are the repro steps? Or is it possible for you to share the core dump somehow? It is a bit hard to be certain just by looking at the source code.

@enjoy-binbin
Copy link
Collaborator

enjoy-binbin commented Nov 27, 2023

my reproduce step:

step A:
7.0 cluster
./utils/create-cluster/create-cluster stop && ./utils/create-cluster/create-cluster clean
./utils/create-cluster/create-cluster start && ./utils/create-cluster/create-cluster create

step B:
7.2 node
rm -rf nodes.conf && ./src/redis-server redis.conf --cluster-enabled yes --port 7000

step C:
./src/redis-cli -p 30001 cluster meet 127.0.0.1 7000
./src/redis-cli --cluster rebalance 127.0.0.1:30001 --cluster-use-empty-masters --cluster-pipeline 1000 --cluster-weight d1a8056b89c9790a6ad836320e7c6f7dfc9fd282=0

step D:
./src/redis-cli -p 7000 cluster shards

I should be doing the C and D step repeatedly, and CLUSTER SHARDS will response with this (we can see the slots section keeps expanding):

1) 1) "slots"            
   2)  1) (integer) 0       
       2) (integer) 5461
       3) (integer) 0                                  
       4) (integer) 5461
       5) (integer) 0       
       6) (integer) 5461
       7) (integer) 0   
       8) (integer) 5461         
       9) (integer) 0       
      10) (integer) 5461
      11) (integer) 0 
      12) (integer) 5461         
   3) "nodes"               
   4) 1)  1) "id"    
          2) "a447baa52a5d5564fbf27974e93e4f0825746d00"
          3) "port"                                    
          4) (integer) 30004
          5) "ip"          
          6) "127.0.0.1"
          7) "endpoint"                                
          8) "127.0.0.1"
          9) "role"        
         10) "replica"
         11) "replication-offset"
         12) (integer) 13566     
         13) "health"     
         14) "online"

The steps to reproduce are quite messy. I upgraded randomly locally.

yean, with the fix for #12805, there shard id will remain in the same shard. However, the shard id of a certain master has not been added to the shardi d list. Maybe there is something missing somewhere.

# the code in here will check memcmp, and since the sender' shard id is not changed, 
# so we won't add it to the shard id list
static void updateShardId(clusterNode *node, const char *shard_id) {
    if (shard_id && memcmp(node->shard_id, shard_id, CLUSTER_NAMELEN) != 0) {
        clusterRemoveNodeFromShard(node);
        memcpy(node->shard_id, shard_id, CLUSTER_NAMELEN);
        clusterAddNodeToShard(shard_id, node);
        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG);
    }
    if (shard_id && myself != node && myself->slaveof == node) {
        if (memcmp(myself->shard_id, shard_id, CLUSTER_NAMELEN) != 0) {
            /* shard-id can diverge right after a rolling upgrade
             * from pre-7.2 releases */
            clusterRemoveNodeFromShard(myself);
            memcpy(myself->shard_id, shard_id, CLUSTER_NAMELEN);
            clusterAddNodeToShard(shard_id, myself);
            clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG);
        }
    }
}

That’s why I plan to add this:

if (ext_shardid == NULL and is_master) clusterAddNodeToShard(sender->shard_id, sender);

So the essential reason is that in the shard id list, that is, server.cluster->shards, #12805 will only add the replica, but not the master, causing this problem

@PingXie
Copy link
Contributor

PingXie commented Nov 27, 2023

Got it, this sounds more like a case where a 7.2 node is just observing a 7.0 shard, not actually replicating from it. Makes me wonder if we even need to go through re-sharding to reproduce this bug?

with the fix for #12805, there shard id will remain in the same shard. However, the shard id of a certain master has not been added to the shardi d list. Maybe there is something missing somewhere.

Did you get a chance to try this out with your latest changes in #12805? The earlier commits had this issue, but I thought your last update would've fixed it for both v7.0 primary and replica nodes.

@enjoy-binbin
Copy link
Collaborator

Makes me wonder if we even need to go through re-sharding to reproduce this bug?

I feel it's not needed, i just follow the issue idea, and then i got an env that could be reproduced stably and didn't want to break it.

Did you get a chance to try this out with your latest changes in #12805? The earlier commits had this issue, but I thought your last update would've fixed it for both v7.0 primary and replica nodes.

i did try it (even earlier), it don't work. the new commit indeed will call updateShardId, but the memcmp check will prevent us add it to the shard id list.

@PingXie
Copy link
Contributor

PingXie commented Nov 27, 2023

You are right. Now I see two potential issues with updateShardId

  1. Guarding the shard dictionary update on the shard id change. These should've been two orthogonal decisions.
  2. The duplication of the shard dictionary update logic for both the incoming node n and myself

Do you like to propose a change?

@enjoy-binbin
Copy link
Collaborator

i am happy to make the change and test it, but a little confused. Do you have any ideas, or can you elaborate a bit more?

@PingXie
Copy link
Contributor

PingXie commented Nov 29, 2023

Thinking about this more, a better and (more) correct fix would to update the shard topology when a 7.0 replica is connected to its 7.0 primary for the very first time. More specifically, we need to inject a updateShardId(sender, master->shard_id) call on cluster_legacy.c:2947. This should fix the issue.

@enjoy-binbin
Copy link
Collaborator

enjoy-binbin commented Nov 29, 2023

we need to inject a updateShardId(sender, master->shard_id) call

I tried it, it didn't work.

Luckily I found the minimal steps to reproduce:

# 7.0 cluster
./utils/create-cluster/create-cluster stop && ./utils/create-cluster/create-cluster clean
./utils/create-cluster/create-cluster start && ./utils/create-cluster/create-cluster create

# 7.2 node
rm -rf nodes.conf && ./src/redis-server redis.conf --cluster-enabled yes --port 6379

# 7.2 node replicate with a 7.0 node
./src/redis-cli -p 30001 cluster meet 127.0.0.1 6379
./src/redis-cli -p 6379 cluster replicate d808332a59dc44d4cf8cd0f54c2ea18e34ded7fa
./src/redis-cli -p 6379 cluster shards && ./src/redis-cli -p 6379 cluster shards

so the reason is that a 7.2 node is a 7.0 node 's slave, the shard id dict does not have the 7.0 node, so when we issue cluster shards in 7.2 node, it went here: #12695 (comment)

@PingXie
Copy link
Contributor

PingXie commented Dec 4, 2023

so the reason is that a 7.2 node is a 7.0 node 's slave, the shard id dict does not have the 7.0 node, so when we issue cluster shards in 7.2 node, it went here: #12695 (comment)

This looks like a different issue than observed in your previous #12695 (comment). I think we should still keep the change I proposed in #12695 (comment).

We could continue with the fix proposed in #12695 (comment) but an alternative could be fixing the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. In this sense, I wonder if we should consider a fix of calling clusterAddNodeToShard at https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L2161 and https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L1615, when the node in question is a primary and if its shard_id is not in the shard dict yet.

@enjoy-binbin
Copy link
Collaborator

I wonder if we should consider a fix of calling clusterAddNodeToShard at https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L2161 and https://github.com/redis/redis/blob/unstable/src/cluster_legacy.c#L1615, when the node in question is a primary and if its shard_id is not in the shard dict yet.

Wow, that's actually what I thought at first, shard dict should be always in sync with the node's shard_id. The first time I tried it, it didn't fix the issue, so I dropped it and considered fixing it with a smaller diff.

I actually agree with this idea. We didn’t add shard dict synchronously in some places, which feels like a hidden danger.

I will try to open a new PR later to add all these changes we mentioned (for better review)

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this issue Dec 5, 2023
…e not sync

Crash reported in redis#12695. In the process of upgrading the cluster from
7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we
will rely on shard id to build the server.cluster->shards dict.

In some cases, for example, the 7.0 master node and the 7.2 replica node.
From the view of 7.2 replica node, the cluster->shards dictionary does not
have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica
node may crash.

A CLUSTER SHARDS result output:
```
1) 1) "slots"
   2)  1) (integer) 0
       2) (integer) 5461
       3) (integer) 0
       4) (integer) 5461
       5) (integer) 0
```

We can see that the output contains repeated slots, and each call will
append a new one, and then crash on serverAssert:
```c
void clusterGenNodesSlotsInfo(int filter) {
    ...
        /* Generate slots info when occur different node with start
         * or end of slot. */
        if (i == CLUSTER_SLOTS || n != server.cluster->slots[i]) {
            if (!(n->flags & filter)) {
                if (!n->slot_info_pairs) {
                    n->slot_info_pairs = zmalloc(2 * n->numslots * sizeof(uint16_t));
                }
                serverAssert((n->slot_info_pairs_count + 1) < (2 * n->numslots));
                n->slot_info_pairs[n->slot_info_pairs_count++] = start;
                n->slot_info_pairs[n->slot_info_pairs_count++] = i-1;
            }
            if (i == CLUSTER_SLOTS) break;
            n = server.cluster->slots[i];
            start = i;
        }
    ...
}
```

The reason is that in addShardReplyForClusterShards we are not able to
clean up the slot_info_pairs corresponding to the 7.0 master node. In the
code below, we will loop to find the 7.0 master node, and then we will call
clusterFreeNodesSlotsInfo to clean up slot_info_pairs according to the shard
id dict list, but the 7.0 master node is not in the list.
```c
void addShardReplyForClusterShards(client *c, list *nodes) {
    ...
    /* Use slot_info_pairs from the primary only */
    while (n->slaveof != NULL) n = n->slaveof;

    ...
    addReplyBulkCString(c, "nodes");
    addReplyArrayLen(c, listLength(nodes));
    listIter li;
    listRewind(nodes, &li);
    for (listNode *ln = listNext(&li); ln != NULL; ln = listNext(&li)) {
        clusterNode *n = listNodeValue(ln);
        addNodeDetailsToShardReply(c, n);
        clusterFreeNodesSlotsInfo(n);
    }
}
```

We should fix the underlying assumption of updateShardId, which is that the
shard dict should be always in sync with the node's shard_id. The fix was
suggested by PingXie, see more details in redis#12695.

Co-authored-by: Ping Xie <pingxie@google.com>
@enjoy-binbin
Copy link
Collaborator

@PingXie thanks! I verified that adding it to clusterRenameNode can fix this issue #12695 (comment). please take a look with the fix #12832.

madolson pushed a commit that referenced this issue Jan 8, 2024
…e not sync (#12832)

Crash reported in #12695. In the process of upgrading the cluster from
7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we
will rely on shard id to build the server.cluster->shards dict.

In some cases, for example, the 7.0 master node and the 7.2 replica node.
From the view of 7.2 replica node, the cluster->shards dictionary does not
have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica
node may crash.

We should fix the underlying assumption of updateShardId, which is that the
shard dict should be always in sync with the node's shard_id. The fix was
suggested by PingXie, see more details in #12695.
oranagra pushed a commit to oranagra/redis that referenced this issue Jan 9, 2024
…e not sync (redis#12832)

Crash reported in redis#12695. In the process of upgrading the cluster from
7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we
will rely on shard id to build the server.cluster->shards dict.

In some cases, for example, the 7.0 master node and the 7.2 replica node.
From the view of 7.2 replica node, the cluster->shards dictionary does not
have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica
node may crash.

We should fix the underlying assumption of updateShardId, which is that the
shard dict should be always in sync with the node's shard_id. The fix was
suggested by PingXie, see more details in redis#12695.

(cherry picked from commit 5b0c6a8)
oranagra pushed a commit that referenced this issue Jan 9, 2024
…e not sync (#12832)

Crash reported in #12695. In the process of upgrading the cluster from
7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we
will rely on shard id to build the server.cluster->shards dict.

In some cases, for example, the 7.0 master node and the 7.2 replica node.
From the view of 7.2 replica node, the cluster->shards dictionary does not
have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica
node may crash.

We should fix the underlying assumption of updateShardId, which is that the
shard dict should be always in sync with the node's shard_id. The fix was
suggested by PingXie, see more details in #12695.

(cherry picked from commit 5b0c6a8)
roggervalf pushed a commit to roggervalf/redis that referenced this issue Feb 11, 2024
…e not sync (redis#12832)

Crash reported in redis#12695. In the process of upgrading the cluster from
7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we
will rely on shard id to build the server.cluster->shards dict.

In some cases, for example, the 7.0 master node and the 7.2 replica node.
From the view of 7.2 replica node, the cluster->shards dictionary does not
have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica
node may crash.

We should fix the underlying assumption of updateShardId, which is that the
shard dict should be always in sync with the node's shard_id. The fix was
suggested by PingXie, see more details in redis#12695.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants