Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge 'Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nod…
…es, recently replaced nodes, and recently changed IPs' from Kamil Braun Before this PR, the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test would still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. We also introduce regression tests with necessary extensions to the test framework. Fixes #14468 Fixes #14487 Closes #14507 * github.com:scylladb/scylladb: test: rename `test_topology_ip.py` to `test_replace.py` test: test bootstrap after IP change test: scylla_cluster: return the new IP from `change_ip` API test: node replace with `ignore_dead_nodes` test test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig` storage_service: remove `get_nodes_to_sync_with` storage_service: use `token_metadata` to calculate nodes waited for to be UP storage_service: don't calculate `ignore_nodes` before waiting for normal handlers
- Loading branch information
Showing
8 changed files
with
187 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# | ||
# Copyright (C) 2023-present ScyllaDB | ||
# | ||
# SPDX-License-Identifier: AGPL-3.0-or-later | ||
# | ||
import time | ||
import pytest | ||
import logging | ||
|
||
from test.pylib.internal_types import IPAddress, HostID | ||
from test.pylib.scylla_cluster import ReplaceConfig | ||
from test.pylib.manager_client import ManagerClient | ||
from test.topology.util import wait_for_token_ring_and_group0_consistency | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_boot_after_ip_change(manager: ManagerClient) -> None: | ||
"""Bootstrap a new node after existing one changed its IP. | ||
Regression test for #14468. Does not apply to Raft-topology mode. | ||
""" | ||
cfg = {'experimental_features': list[str]()} | ||
logger.info(f"Booting initial cluster") | ||
servers = [await manager.server_add(config=cfg) for _ in range(2)] | ||
await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30) | ||
|
||
logger.info(f"Stopping server {servers[1]}") | ||
await manager.server_stop_gracefully(servers[1].server_id) | ||
|
||
logger.info(f"Changing IP of server {servers[1]}") | ||
new_ip = await manager.server_change_ip(servers[1].server_id) | ||
servers[1] = servers[1]._replace(ip_addr = new_ip) | ||
logger.info(f"New IP: {new_ip}") | ||
|
||
logger.info(f"Restarting server {servers[1]}") | ||
await manager.server_start(servers[1].server_id) | ||
|
||
# We need to do this wait before we boot a new node. | ||
# Otherwise the newly booting node may contact servers[0] even before servers[0] | ||
# saw the new IP of servers[1], and then the booting node will try to wait | ||
# for servers[1] to be alive using its old IP (and eventually time out). | ||
# | ||
# Note that this still acts as a regression test for #14468. | ||
# In #14468, the problem was that a booting node would try to wait for the old IP | ||
# of servers[0] even after all existing servers saw the IP change. | ||
logger.info(f"Wait until {servers[0]} sees the new IP of {servers[1]}") | ||
await manager.server_sees_other_server(servers[0].ip_addr, servers[1].ip_addr) | ||
|
||
logger.info(f"Booting new node") | ||
await manager.server_add(config=cfg) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# | ||
# Copyright (C) 2023-present ScyllaDB | ||
# | ||
# SPDX-License-Identifier: AGPL-3.0-or-later | ||
# | ||
import time | ||
import pytest | ||
import logging | ||
|
||
from test.pylib.internal_types import IPAddress, HostID | ||
from test.pylib.scylla_cluster import ReplaceConfig | ||
from test.pylib.manager_client import ManagerClient | ||
from test.topology.util import wait_for_token_ring_and_group0_consistency | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_replace_ignore_nodes(manager: ManagerClient) -> None: | ||
"""Replace a node in presence of multiple dead nodes. | ||
Regression test for #14487. Does not apply to Raft-topology mode. | ||
This is a slow test with a 7 node cluster any 3 replace operations, | ||
we don't want to run it in debug mode. | ||
Preferably run it only in one mode e.g. dev. | ||
""" | ||
cfg = {'experimental_features': list[str]()} | ||
logger.info(f"Booting initial cluster") | ||
servers = [await manager.server_add(config=cfg) for _ in range(7)] | ||
s2_id = await manager.get_host_id(servers[2].server_id) | ||
logger.info(f"Stopping servers {servers[:3]}") | ||
await manager.server_stop(servers[0].server_id) | ||
await manager.server_stop(servers[1].server_id) | ||
await manager.server_stop_gracefully(servers[2].server_id) | ||
|
||
# The parameter accepts both IP addrs with host IDs. | ||
# We must be able to resolve them in both ways. | ||
ignore_dead: list[IPAddress | HostID] = [servers[1].ip_addr, s2_id] | ||
logger.info(f"Replacing {servers[0]}, ignore_dead_nodes = {ignore_dead}") | ||
replace_cfg = ReplaceConfig(replaced_id = servers[0].server_id, reuse_ip_addr = False, use_host_id = False, | ||
ignore_dead_nodes = ignore_dead) | ||
await manager.server_add(replace_cfg=replace_cfg, config=cfg) | ||
await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30) | ||
|
||
ignore_dead = [servers[2].ip_addr] | ||
logger.info(f"Replacing {servers[1]}, ignore_dead_nodes = {ignore_dead}") | ||
replace_cfg = ReplaceConfig(replaced_id = servers[1].server_id, reuse_ip_addr = False, use_host_id = False, | ||
ignore_dead_nodes = ignore_dead) | ||
await manager.server_add(replace_cfg=replace_cfg, config=cfg) | ||
await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30) | ||
|
||
logger.info(f"Replacing {servers[2]}") | ||
replace_cfg = ReplaceConfig(replaced_id = servers[2].server_id, reuse_ip_addr = False, use_host_id = False) | ||
await manager.server_add(replace_cfg=replace_cfg, config=cfg) | ||
await wait_for_token_ring_and_group0_consistency(manager, time.time() + 30) |