Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster)) #14487

Closed
bhalevy opened this issue Jul 3, 2023 · 9 comments · Fixed by #14507
Assignees
Labels
P1 Urgent status/regression symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Jul 3, 2023

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/294/artifact/logs-full.release.019/1688369920023_repair_based_node_operations_test.py%3A%3ATestRepairBasedNodeOperations%3A%3Atest_ignore_dead_nodes_for_replace_option/node8.log

Scylla version 5.4.0~dev-0.20230703.1ab2bb69b8a6 with build-id 1054adfa55f238441c3044e6213fd02d31d43279 starting ...
command used: "/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/bin/scylla --options-file /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/conf/scylla.yaml --log-to-stdout 1 --abort-on-seastar-bad-alloc --abort-on-lsa-bad-alloc 1 --abort-on-internal-error 1 --enable-repair-based-node-ops true --ignore-dead-nodes-for-replace 127.0.55.6,127.0.55.7 --api-address 127.0.55.8 --collectd-hostname d332c648e72c.node8 --smp 2 --memory 1024M --developer-mode true --default-log-level info --collectd 0 --overprovisioned --prometheus-address 127.0.55.8 --replace-node-first-boot d97b63b2-72f5-4640-bba8-893420adf609 --unsafe-bypass-fsync 1 --kernel-page-cache 1 --commitlog-use-o-dsync 0 --max-networking-io-control-blocks 1000"
parsed command line options: [options-file: /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/conf/scylla.yaml, log-to-stdout, (positional) 1, abort-on-seastar-bad-alloc, abort-on-lsa-bad-alloc: 1, abort-on-internal-error: 1, enable-repair-based-node-ops: true, ignore-dead-nodes-for-replace: 127.0.55.6,127.0.55.7, api-address: 127.0.55.8, collectd-hostname, (positional) d332c648e72c.node8, smp, (positional) 2, memory, (positional) 1024M, developer-mode: true, default-log-level, (positional) info, collectd, (positional) 0, overprovisioned, prometheus-address: 127.0.55.8, replace-node-first-boot: d97b63b2-72f5-4640-bba8-893420adf609, unsafe-bypass-fsync, (positional) 1, kernel-page-cache, (positional) 1, commitlog-use-o-dsync: 0, max-networking-io-control-blocks, (positional) 1000]

INFO  2023-07-03 07:38:26,583 [shard 0] init - seeds={127.0.55.1, 127.0.55.2, 127.0.55.3, 127.0.55.4, 127.0.55.5, 127.0.55.6, 127.0.55.7}, listen_address=127.0.55.8, broadcast_address=127.0.55.8

INFO  2023-07-03 07:38:26,662 [shard 0] storage_service - entering STARTING mode
INFO  2023-07-03 07:38:26,662 [shard 0] storage_service - Loading persisted ring state
INFO  2023-07-03 07:38:26,664 [shard 0] storage_service - initial_contact_nodes={127.0.55.7, 127.0.55.6, 127.0.55.5, 127.0.55.4, 127.0.55.3, 127.0.55.2, 127.0.55.1}, loaded_endpoints={}, loaded_peer_features=0
INFO  2023-07-03 07:38:26,664 [shard 0] storage_service - Gathering node replacement information for d97b63b2-72f5-4640-bba8-893420adf609/0000:0000:0000:0000:0000:0000:0000:0000
INFO  2023-07-03 07:38:26,664 [shard 0] storage_service - Checking remote features with gossip
INFO  2023-07-03 07:38:26,664 [shard 0] gossip - Gossip shadow round started with nodes={127.0.55.7, 127.0.55.6, 127.0.55.5, 127.0.55.4, 127.0.55.3, 127.0.55.2, 127.0.55.1}
WARN  2023-07-03 07:38:26,665 [shard 0] gossip - Node 127.0.55.7 is down for get_endpoint_states verb
WARN  2023-07-03 07:38:26,665 [shard 0] gossip - Node 127.0.55.6 is down for get_endpoint_states verb
WARN  2023-07-03 07:38:26,665 [shard 0] gossip - Node 127.0.55.5 is down for get_endpoint_states verb
INFO  2023-07-03 07:38:26,666 [shard 0] gossip - Gossip shadow round finished with nodes_talked={127.0.55.2, 127.0.55.1, 127.0.55.3, 127.0.55.4}
INFO  2023-07-03 07:38:26,666 [shard 0] gossip - Feature check passed. Local node 127.0.55.8 features = {AGGREGATE_STORAGE_OPTIONS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COLLECTION_INDEXING, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, EMPTY_REPLICA_PAGES, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_COLLECTION_DETECTION, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_COMMITLOG, SCHEMA_TABLES_V3, SECONDARY_INDEXES_ON_STATIC_COLUMNS, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, TYPED_ERRORS_IN_READ_RPC, UDA, UDA_NATIVE_PARALLELIZED_AGGREGATION, UNBOUNDED_RANGE_TOMBSTONES, UUID_SSTABLE_IDENTIFIERS, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {AGGREGATE_STORAGE_OPTIONS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COLLECTION_INDEXING, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, EMPTY_REPLICA_PAGES, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_COLLECTION_DETECTION, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_COMMITLOG, SCHEMA_TABLES_V3, SECONDARY_INDEXES_ON_STATIC_COLUMNS, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, TYPED_ERRORS_IN_READ_RPC, UDA, UDA_NATIVE_PARALLELIZED_AGGREGATION, UNBOUNDED_RANGE_TOMBSTONES, UUID_SSTABLE_IDENTIFIERS, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}
INFO  2023-07-03 07:38:26,666 [shard 0] storage_service - Host 189dd9bb-d950-446c-b527-a544925bc828/127.0.55.8 is replacing d97b63b2-72f5-4640-bba8-893420adf609/127.0.55.5
INFO  2023-07-03 07:38:26,666 [shard 0] storage_service - Replacing a node with a different IP address, my address=127.0.55.8, node being replaced=127.0.55.5
INFO  2023-07-03 07:38:26,666 [shard 0] storage_service - Save advertised features list in the 'system.local' table
INFO  2023-07-03 07:38:26,669 [shard 0] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-07-03 07:38:26,669 [shard 0] storage_service - Starting up server gossip
INFO  2023-07-03 07:38:26,676 [shard 0] compaction - [Compact system.local 98dfab40-1974-11ee-8f2d-ec009dbc2650] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-2-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-4-big-Data.db:level=0:origin=memtable]
INFO  2023-07-03 07:38:26,677 [shard 0] gossip - failure_detector_loop: Started main loop
INFO  2023-07-03 07:38:26,677 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-07-03 07:38:26,686 [shard 0] compaction - [Compact system.local 98dfab40-1974-11ee-8f2d-ec009dbc2650] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-nvlixxhx/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-6-big-Data.db:level=0]. 13kB to 7kB (~55% of original) in 6ms = 1MB/s. ~256 total partitions merged to 1.
INFO  2023-07-03 07:38:27,988 [shard 0] storage_service - Set host_id=9137e227-94f4-449e-a6c6-dceed01cff99 to be owned by node=127.0.55.4
WARN  2023-07-03 07:38:27,990 [shard 0] gossip - Fail to send EchoMessage to 127.0.55.5: seastar::rpc::closed_error (connection is closed)
WARN  2023-07-03 07:38:27,990 [shard 0] gossip - Fail to send EchoMessage to 127.0.55.6: seastar::rpc::closed_error (connection is closed)
WARN  2023-07-03 07:38:27,990 [shard 0] gossip - Fail to send EchoMessage to 127.0.55.7: seastar::rpc::closed_error (connection is closed)
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.4 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.3 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.1 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,990 [shard 0] gossip - InetAddress 127.0.55.2 is now UP, status = NORMAL
INFO  2023-07-03 07:38:27,991 [shard 0] storage_service - Set host_id=c6089c7e-809f-447e-a2c8-91009de372e2 to be owned by node=127.0.55.3
INFO  2023-07-03 07:38:27,993 [shard 0] storage_service - Set host_id=66e3c3ad-d16d-469c-98e2-6c3ff5f8aaf8 to be owned by node=127.0.55.1
INFO  2023-07-03 07:38:27,994 [shard 0] gossip - Live nodes seen in gossip: {127.0.55.1, 127.0.55.2, 127.0.55.3, 127.0.55.4, 127.0.55.8}

==> [BH] it looks like the exception happened here

INFO  2023-07-03 07:38:27,995 [shard 0] init - Shutting down group 0 service
...
INFO  2023-07-03 07:38:27,997 [shard 0] storage_service - Set host_id=ec4c0157-1523-4a1d-8aeb-e56e685e507a to be owned by node=127.0.55.6

==> [BH] handle_state_normal happens in parallel to storage_service::join_token_ring

...
INFO  2023-07-03 07:38:28,026 [shard 0] gossip - InetAddress 127.0.55.6 is now DOWN, status = shutdown
...
ERROR 2023-07-03 07:38:36,614 [shard 0] init - Startup failed: std::runtime_error (Failed to parse node list: {127.0.55.6, 127.0.55.7}: invalid node=127.0.55.6: std::runtime_error (Host inet address 127.0.55.6 not found in the cluster))

This looks like another fallout from 50e8ec7.

I think that in

co_await _gossiper.wait_for_live_nodes_to_show_up(2);
auto ignore_nodes = ri
? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), get_token_metadata())
// TODO: specify ignore_nodes for bootstrap
: std::unordered_set<gms::inet_address>{};

wait_for_live_nodes_to_show_up(2) doesn't ensure that the nodes in the ignore list (that are given by ip address in this test) have their host_id set up by

slogger.info("Set host_id={} to be owned by node={}", host_id, endpoint);
tmptr->update_host_id(host_id, endpoint);

@bhalevy bhalevy added status/regression P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework triage/master Looking for assignee labels Jul 3, 2023
@DoronArazii
Copy link

@kbr-scylla please have a look

@DoronArazii DoronArazii added this to the 5.3 milestone Jul 3, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Jul 3, 2023

@DoronArazii this issue doesn't exist in 5.3 since 50e8ec7 is confined to master (5.4.dev)

@bhalevy bhalevy modified the milestones: 5.3, 5.4 Jul 3, 2023
@kbr-scylla
Copy link
Contributor

wait_for_live_nodes_to_show_up(2) doesn't ensure that the nodes in the ignore list (that are given by ip address in this test) have their host_id set up

How was that ensured before in this test? By ring_delay sleep?

@kbr-scylla
Copy link
Contributor

We've got a bit of a chicken-and-egg problem here:

  • we need to wait for NORMAL state handlers for nodes from set X
  • to calculate set X, we need to calculate ignore_nodes
  • but to calculate ignore_nodes, we need to wait until these nodes' NORMAL state handlers have finished (at least partially - up to the step that sets host_ids)...

Damn it, why did I think that trying to improve gossiper boot code is a good idea.

@kbr-scylla
Copy link
Contributor

    , ignore_dead_nodes_for_replace(this, "ignore_dead_nodes_for_replace", value_status::Used, "", "List dead nodes to ignore for replace operation using a comma-separated list of host IDs. E.g., scylla --ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e")

it's interesting that the implementation actually supports passing IPs

@bhalevy
Copy link
Member Author

bhalevy commented Jul 3, 2023

    , ignore_dead_nodes_for_replace(this, "ignore_dead_nodes_for_replace", value_status::Used, "", "List dead nodes to ignore for replace operation using a comma-separated list of host IDs. E.g., scylla --ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e")

it's interesting that the implementation actually supports passing IPs

That's for backward compatibility.
We can decide to deprecate that and require passing host_ids.

@kbr-scylla
Copy link
Contributor

However deprecation requires a deprecation period, so I still need to solve the problem.

I do have an idea, but I need to verify it.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
Regression test for scylladb#14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
Regression test for scylladb#14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
Regression test for scylladb#14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 6, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 6, 2023
Regression test for scylladb#14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
@DoronArazii DoronArazii removed the triage/master Looking for assignee label Jul 9, 2023
tgrabiec added a commit that referenced this issue Jul 9, 2023
…es, recently replaced nodes, and recently changed IPs' from Kamil Braun

Before this PR, the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

It turns out that we don't need to calculate `sync_nodes` (and
hence `ignore_nodes`) in order to wait for NORMAL state handlers. We
can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries
appearing in gossiper, even those that correspond to dead/ignored
nodes and obsolete IPs.  `handle_state_normal` is called, and
eventually finishes, for all of them.
`wait_for_normal_state_handled_on_boot` no longer receives a set of
nodes as parameter and is modified appropriately, it's now calculating
the necessary set of nodes on each retry (the set may shrink while
we're waiting, e.g. because an entry corresponding to a node that was
replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in #14487, but the test
would still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

We also introduce regression tests with necessary extensions
to the test framework.

Fixes #14468
Fixes #14487

Closes #14507

* github.com:scylladb/scylladb:
  test: rename `test_topology_ip.py` to `test_replace.py`
  test: test bootstrap after IP change
  test: scylla_cluster: return the new IP from `change_ip` API
  test: node replace with `ignore_dead_nodes` test
  test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`
  storage_service: remove `get_nodes_to_sync_with`
  storage_service: use `token_metadata` to calculate nodes waited for to be UP
  storage_service: don't calculate `ignore_nodes` before waiting for normal handlers
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Jul 10, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Jul 10, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Jul 10, 2023
Regression test for scylladb#14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Aug 28, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Aug 28, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Aug 28, 2023
Regression test for scylladb#14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
@avikivity
Copy link
Member

@bhalevy guessing this was a transient regression and doesn't need a backport

@bhalevy
Copy link
Member Author

bhalevy commented Dec 3, 2023

@bhalevy guessing this was a transient regression and doesn't need a backport

Correct. Both patch causing the regression and fix are in 5.4 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent status/regression symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
5 participants