test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster)) #14487

bhalevy · 2023-07-03T09:26:29Z

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/294/artifact/logs-full.release.019/1688369920023_repair_based_node_operations_test.py%3A%3ATestRepairBasedNodeOperations%3A%3Atest_ignore_dead_nodes_for_replace_option/node8.log

Scylla version 5.4.0~dev-0.20230703.1ab2bb69b8a6 with build-id 1054adfa55f238441c3044e6213fd02d31d43279 starting ...
command used: "/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/bin/scylla --options-file /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/conf/scylla.yaml --log-to-stdout 1 --abort-on-seastar-bad-alloc --abort-on-lsa-bad-alloc 1 --abort-on-internal-error 1 --enable-repair-based-node-ops true --ignore-dead-nodes-for-replace 127.0.55.6,127.0.55.7 --api-address 127.0.55.8 --collectd-hostname d332c648e72c.node8 --smp 2 --memory 1024M --developer-mode true --default-log-level info --collectd 0 --overprovisioned --prometheus-address 127.0.55.8 --replace-node-first-boot d97b63b2-72f5-4640-bba8-893420adf609 --unsafe-bypass-fsync 1 --kernel-page-cache 1 --commitlog-use-o-dsync 0 --max-networking-io-control-blocks 1000"
parsed command line options: [options-file: /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/conf/scylla.yaml, log-to-stdout, (positional) 1, abort-on-seastar-bad-alloc, abort-on-lsa-bad-alloc: 1, abort-on-internal-error: 1, enable-repair-based-node-ops: true, ignore-dead-nodes-for-replace: 127.0.55.6,127.0.55.7, api-address: 127.0.55.8, collectd-hostname, (positional) d332c648e72c.node8, smp, (positional) 2, memory, (positional) 1024M, developer-mode: true, default-log-level, (positional) info, collectd, (positional) 0, overprovisioned, prometheus-address: 127.0.55.8, replace-node-first-boot: d97b63b2-72f5-4640-bba8-893420adf609, unsafe-bypass-fsync, (positional) 1, kernel-page-cache, (positional) 1, commitlog-use-o-dsync: 0, max-networking-io-control-blocks, (positional) 1000]

INFO  2023-07-03 07:38:26,583 [shard 0] init - seeds={127.0.55.1, 127.0.55.2, 127.0.55.3, 127.0.55.4, 127.0.55.5, 127.0.55.6, 127.0.55.7}, listen_address=127.0.55.8, broadcast_address=127.0.55.8

INFO  2023-07-03 07:38:26,662 [shard 0] storage_service - entering STARTING mode
INFO  2023-07-03 07:38:26,662 [shard 0] storage_service - Loading persisted ring state
INFO  2023-07-03 07:38:26,664 [shard 0] storage_service - initial_contact_nodes={127.0.55.7, 127.0.55.6, 127.0.55.5, 127.0.55.4, 127.0.55.3, 127.0.55.2, 127.0.55.1}, loaded_endpoints={}, loaded_peer_features=0
INFO  2023-07-03 07:38:26,664 [shard 0] storage_service - Gathering node replacement information for d97b63b2-72f5-4640-bba8-893420adf609/0000:0000:0000:0000:0000:0000:0000:0000
INFO  2023-07-03 07:38:26,664 [shard 0] storage_service - Checking remote features with gossip
INFO  2023-07-03 07:38:26,664 [shard 0] gossip - Gossip shadow round started with nodes={127.0.55.7, 127.0.55.6, 127.0.55.5, 127.0.55.4, 127.0.55.3, 127.0.55.2, 127.0.55.1}
WARN  2023-07-03 07:38:26,665 [shard 0] gossip - Node 127.0.55.7 is down for get_endpoint_states verb
WARN  2023-07-03 07:38:26,665 [shard 0] gossip - Node 127.0.55.6 is down for get_endpoint_states verb
WARN  2023-07-03 07:38:26,665 [shard 0] gossip - Node 127.0.55.5 is down for get_endpoint_states verb
INFO  2023-07-03 07:38:26,666 [shard 0] gossip - Gossip shadow round finished with nodes_talked={127.0.55.2, 127.0.55.1, 127.0.55.3, 127.0.55.4}
INFO  2023-07-03 07:38:26,666 [shard 0] gossip - Feature check passed. Local node 127.0.55.8 features = {AGGREGATE_STORAGE_OPTIONS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COLLECTION_INDEXING, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, EMPTY_REPLICA_PAGES, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_COLLECTION_DETECTION, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_COMMITLOG, SCHEMA_TABLES_V3, SECONDARY_INDEXES_ON_STATIC_COLUMNS, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, TYPED_ERRORS_IN_READ_RPC, UDA, UDA_NATIVE_PARALLELIZED_AGGREGATION, UNBOUNDED_RANGE_TOMBSTONES, UUID_SSTABLE_IDENTIFIERS, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {AGGREGATE_STORAGE_OPTIONS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COLLECTION_INDEXING, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, EMPTY_REPLICA_PAGES, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_COLLECTION_DETECTION, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_COMMITLOG, SCHEMA_TABLES_V3, SECONDARY_INDEXES_ON_STATIC_COLUMNS, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, TYPED_ERRORS_IN_READ_RPC, UDA, UDA_NATIVE_PARALLELIZED_AGGREGATION, UNBOUNDED_RANGE_TOMBSTONES, UUID_SSTABLE_IDENTIFIERS, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}
INFO  2023-07-03 07:38:26,666 [shard 0] storage_service - Host 189dd9bb-d950-446c-b527-a544925bc828/127.0.55.8 is replacing d97b63b2-72f5-4640-bba8-893420adf609/127.0.55.5
INFO  2023-07-03 07:38:26,666 [shard 0] storage_service - Replacing a node with a different IP address, my address=127.0.55.8, node being replaced=127.0.55.5
INFO  2023-07-03 07:38:26,666 [shard 0] storage_service - Save advertised features list in the 'system.local' table
INFO  2023-07-03 07:38:26,669 [shard 0] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-07-03 07:38:26,669 [shard 0] storage_service - Starting up server gossip
INFO  2023-07-03 07:38:26,676 [shard 0] compaction - [Compact system.local 98dfab40-1974-11ee-8f2d-ec009dbc2650] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-2-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-4-big-Data.db:level=0:origin=memtable]
INFO  2023-07-03 07:38:26,677 [shard 0] gossip - failure_detector_loop: Started main loop
INFO  2023-07-03 07:38:26,677 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-07-03 07:38:26,686 [shard 0] compaction - [Compact system.local 98dfab40-1974-11ee-8f2d-ec009dbc2650] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-6-big-Data.db:level=0]. 13kB to 7kB (~55% of original) in 6ms = 1MB/s. ~256 total partitions merged to 1.
INFO  2023-07-03 07:38:27,988 [shard 0] storage_service - Set host_id=9137e227-94f4-449e-a6c6-dceed01cff99 to be owned by node=127.0.55.4
WARN  2023-07-03 07:38:27,990 [shard 0] gossip - Fail to send EchoMessage to 127.0.55.5: seastar::rpc::closed_error (connection is closed)
WARN  2023-07-03 07:38:27,990 [shard 0] gossip - Fail to send EchoMessage to 127.0.55.6: seastar::rpc::closed_error (connection is closed)
WARN  2023-07-03 07:38:27,990 [shard 0] gossip - Fail to send EchoMessage to 127.0.55.7: seastar::rpc::closed_error (connection is closed)
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.4 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.3 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.1 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.2 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,991 [shard 0] storage_service - Set host_id=c6089c7e-809f-447e-a2c8-91009de372e2 to be owned by node=127.0.55.3
INFO  2023-07-03 07:38:27,993 [shard 0] storage_service - Set host_id=66e3c3ad-d16d-469c-98e2-6c3ff5f8aaf8 to be owned by node=127.0.55.1
INFO  2023-07-03 07:38:27,994 [shard 0] gossip - Live nodes seen in gossip: {127.0.55.1, 127.0.55.2, 127.0.55.3, 127.0.55.4, 127.0.55.8}

==> [BH] it looks like the exception happened here

INFO  2023-07-03 07:38:27,995 [shard 0] init - Shutting down group 0 service
...
INFO  2023-07-03 07:38:27,997 [shard 0] storage_service - Set host_id=ec4c0157-1523-4a1d-8aeb-e56e685e507a to be owned by node=127.0.55.6

==> [BH] handle_state_normal happens in parallel to storage_service::join_token_ring

...
INFO  2023-07-03 07:38:28,026 [shard 0] gossip - InetAddress 127.0.55.6 is now DOWN, status = shutdown
...
ERROR 2023-07-03 07:38:36,614 [shard 0] init - Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster))

This looks like another fallout from 50e8ec7.

I think that in

scylladb/service/storage_service.cc

Lines 1954 to 1959 in 50e8ec7

    
           co_await _gossiper.wait_for_live_nodes_to_show_up(2); 
        
           auto ignore_nodes = ri 
        
                   ? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), get_token_metadata()) 
        
                   // TODO: specify ignore_nodes for bootstrap 
        
                   : std::unordered_set<gms::inet_address>{};

wait_for_live_nodes_to_show_up(2) doesn't ensure that the nodes in the ignore list (that are given by ip address in this test) have their host_id set up by

scylladb/service/storage_service.cc

Lines 2496 to 2497 in 50e8ec7

    
           slogger.info("Set host_id={} to be owned by node={}", host_id, endpoint); 
        
           tmptr->update_host_id(host_id, endpoint);

The text was updated successfully, but these errors were encountered:

DoronArazii · 2023-07-03T09:29:24Z

@kbr-scylla please have a look

bhalevy · 2023-07-03T09:41:53Z

@DoronArazii this issue doesn't exist in 5.3 since 50e8ec7 is confined to master (5.4.dev)

kbr-scylla · 2023-07-03T09:54:03Z

wait_for_live_nodes_to_show_up(2) doesn't ensure that the nodes in the ignore list (that are given by ip address in this test) have their host_id set up

How was that ensured before in this test? By ring_delay sleep?

kbr-scylla · 2023-07-03T10:30:29Z

We've got a bit of a chicken-and-egg problem here:

we need to wait for NORMAL state handlers for nodes from set X
to calculate set X, we need to calculate ignore_nodes
but to calculate ignore_nodes, we need to wait until these nodes' NORMAL state handlers have finished (at least partially - up to the step that sets host_ids)...

Damn it, why did I think that trying to improve gossiper boot code is a good idea.

kbr-scylla · 2023-07-03T10:37:04Z

    , ignore_dead_nodes_for_replace(this, "ignore_dead_nodes_for_replace", value_status::Used, "", "List dead nodes to ignore for replace operation using a comma-separated list of host IDs. E.g., scylla --ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e")

it's interesting that the implementation actually supports passing IPs

bhalevy · 2023-07-03T10:52:45Z

    , ignore_dead_nodes_for_replace(this, "ignore_dead_nodes_for_replace", value_status::Used, "", "List dead nodes to ignore for replace operation using a comma-separated list of host IDs. E.g., scylla --ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e")

it's interesting that the implementation actually supports passing IPs

That's for backward compatibility.
We can decide to deprecate that and require passing host_ids.

kbr-scylla · 2023-07-03T10:55:24Z

However deprecation requires a deprecation period, so I still need to solve the problem.

I do have an idea, but I need to verify it.

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

Regression test for scylladb#14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.

…es, recently replaced nodes, and recently changed IPs' from Kamil Braun Before this PR, the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test would still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. We also introduce regression tests with necessary extensions to the test framework. Fixes #14468 Fixes #14487 Closes #14507 * github.com:scylladb/scylladb: test: rename `test_topology_ip.py` to `test_replace.py` test: test bootstrap after IP change test: scylla_cluster: return the new IP from `change_ip` API test: node replace with `ignore_dead_nodes` test test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig` storage_service: remove `get_nodes_to_sync_with` storage_service: use `token_metadata` to calculate nodes waited for to be UP storage_service: don't calculate `ignore_nodes` before waiting for normal handlers

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.

avikivity · 2023-11-30T15:22:12Z

@bhalevy guessing this was a transient regression and doesn't need a backport

bhalevy · 2023-12-03T18:27:44Z

@bhalevy guessing this was a transient regression and doesn't need a backport

Correct. Both patch causing the regression and fix are in 5.4 now.

bhalevy assigned kbr-scylla Jul 3, 2023

bhalevy added status/regression P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework triage/master Looking for assignee labels Jul 3, 2023

DoronArazii added this to the 5.3 milestone Jul 3, 2023

bhalevy modified the milestones: 5.3, 5.4 Jul 3, 2023

kbr-scylla mentioned this issue Jul 4, 2023

storage_service: force_remove_endpoint of changed ip address #14471

Open

kbr-scylla mentioned this issue Jul 4, 2023

Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nodes, recently replaced nodes, and recently changed IPs #14507

Merged

DoronArazii removed the triage/master Looking for assignee label Jul 9, 2023

scylladb-promoter closed this as completed in 96278a0 Jul 10, 2023

scylladb-promoter added the Backport candidate label Jul 10, 2023

bhalevy mentioned this issue Jul 10, 2023

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error: Failed to parse node list #14594

Closed

bhalevy removed the Backport candidate label Dec 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster)) #14487

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster)) #14487

bhalevy commented Jul 3, 2023

DoronArazii commented Jul 3, 2023

bhalevy commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

bhalevy commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

avikivity commented Nov 30, 2023

bhalevy commented Dec 3, 2023

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster)) #14487

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster)) #14487

Comments

bhalevy commented Jul 3, 2023

DoronArazii commented Jul 3, 2023

bhalevy commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

bhalevy commented Jul 3, 2023

kbr-scylla commented Jul 3, 2023

avikivity commented Nov 30, 2023

bhalevy commented Dec 3, 2023