test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown #14468

bhalevy · 2023-07-01T09:16:57Z

Seen first in https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/290/testReport/update_cluster_layout_tests/TestUpdateClusterLayout/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_decommission_after_changing_node_ip/

RuntimeError: The process is dead, returncode=1

https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/290/artifact/logs-full.release.010/1688093180144_update_cluster_layout_tests.py%3A%3ATestUpdateClusterLayout%3A%3Atest_decommission_after_changing_node_ip/node4.log

Scylla version 5.4.0~dev-0.20230630.ff386e7a445d with build-id 228b9c341f6b5404947f4a2020d4a25eeaacebbe starting ...

INFO  2023-06-30 02:45:44,589 [shard 0] init - seeds={127.0.96.1, 127.0.96.2, 127.0.96.4, 127.0.96.33}, listen_address=127.0.96.4, broadcast_address=127.0.96.4

INFO  2023-06-30 02:45:44,675 [shard 0] storage_service - Starting up server gossip
INFO  2023-06-30 02:45:44,683 [shard 0] compaction - [Compact system.local 35def3b0-16f0-11ee-a3e4-02af61fe3371] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-0iu1ozg8/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/mc-2-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-0iu1ozg8/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/mc-4-big-Data.db:level=0:origin=memtable]
INFO  2023-06-30 02:45:44,683 [shard 0] gossip - failure_detector_loop: Started main loop
INFO  2023-06-30 02:45:44,683 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-06-30 02:45:44,693 [shard 0] compaction - [Compact system.local 35def3b0-16f0-11ee-a3e4-02af61fe3371] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-0iu1ozg8/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/mc-6-big-Data.db:level=0]. 13kB to 7kB (~55% of original) in 7ms = 1MB/s. ~256 total partitions merged to 1.
INFO  2023-06-30 02:45:46,131 [shard 0] storage_service - Set host_id=97569d56-628c-4485-b0e6-1c68c40c537f to be owned by node=127.0.96.1
INFO  2023-06-30 02:45:46,132 [shard 0] gossip - InetAddress 127.0.96.33 is now DOWN, status = LEFT
WARN  2023-06-30 02:45:46,132 [shard 0] gossip - Fail to send EchoMessage to 127.0.96.3: seastar::rpc::closed_error (connection is closed)
INFO  2023-06-30 02:45:46,132 [shard 0] gossip - InetAddress 127.0.96.1 is now UP, status = NORMAL
INFO  2023-06-30 02:45:46,132 [shard 0] gossip - InetAddress 127.0.96.2 is now UP, status = NORMAL
INFO  2023-06-30 02:45:46,133 [shard 0] gossip - Node 127.0.96.33 will be removed from gossip at [2023-07-03 02:45:19]: (expire = 1688352319445716595, now = 1688093146132961662, diff = 259173 seconds)
INFO  2023-06-30 02:45:46,133 [shard 0] storage_service - Removing tokens {1957138825282075456, 538228788046155918, 1441663362887118471, 2009889732205232147, 1285921624202456720, 2299813546280369809, 1147922929401656711, 3437689283141886193, 1093939752358477342, -95578895009921325, -933537828167744695, -4822175528068997885, -9064814390348450199, -2838398123611190415, -8865697753194005691, -8641240172602522619, -8527192015627880935, -8455721658642346121, -1602885157652809629, -826999321362664251, -8004313790603409879, -7983704426324750105, 6745400184375170269, 1897403336345133737, -7785637751534354940, 6402350960772843634, -9131862161278302934, 7015699157901876878, -7507635517787229617, -7872263450865226137, -2398561151223758453, -2757858320304448776, -7135017416280759518, 762591318120088757, -7036247732884944684, -4706718406784561136, 772993406584974221, -6901710291266973174, -6873621148550053390, -131764165975166161, -6467486352483923794, -8723719628383537842, -7239257021703251755, -8876209735699572524, -6300796970024261941, -633462019235493849, -6310403927118869660, -1431497748581863464, 3697526867126198196, 4923690827301917877, 8112265936827023660, -6277863917912802240, -6623626227015086286, -6263567382473675505, 2044340043996898806, -1684759488608644447, 4219519608993832438, 622262989057717658, -6132029616733649227, -2863638381489490074, -6093224496148114428, -7168585208466197697, 82753073108884114, -5933252256733306442, -5452048513549809665, 3547905835448042745, -5895694438770948515, 7045113366380893923, -6640938570355561189, -5660282275830501698, -3728555925078188231, -4034527578391978588, -5233014222012667101, -8844096475636041617, -2839352313897520511, -917431164596666618, 3618597186644201391, 1638066742741535716, -5023041224884994968, -2616649082261222559, -2415888436872385330, -2940916223950425714, -8994438244973781987, -833074952334662081, -4426602942396513532, 649128337838919561, 7775225799600157744, 1580061385397170636, -2402418775007732133, -7427857065668640033, -1049799291451885300, -2550242859911733562, 5198876569107239321, -8301793316585325178, 5554285130267436886, -8058007130247916491, -4088885882961805256, -7332104044879353661, -1732367245764070833, -5214440851373603397, -4028381155067564419, 741098824345071851, -2076353771709165172, 2095974798297043788, 3679342026564125770, -637331905247054159, -42107836694803384, 6736375668058457916, -231760654102333982, 4133801035329363389, -129951755050233280, -1712843812334547699, -2812010428323279743, -8972822811092831737, -5934804369128314387, -1911244404117897093, 7399718564801560588, -426914257413052499, 1951602158941318813, -2310239841370092592, -5096146377327979133, -5353838904714672202, 7416808377702594272, -2762853363408437030, -1987318027109743947, -2698847702965464001, -2389636933915703465, -1642036105268827801, 729012831382377832, 4192467494245733578, 453328910274435383, 4604210925835301206, -4861600011407716479, -8866456080701519922, 4710686227407048955, 4060438571529584411, 8082128952710491799, -8917743902008465042, 4830388162512167990, 5405228356480841250, 3896618241085296806, 7483054753003347861, 4848712762403721656, 8155890873069566919, 6050126936624620059, 8143687424717096518, 5099342403467628764, 8193264499918054230, 8364320176004023673, 7464679094141006861, 2369583153774542736, 4661971083725460006, 6557277420237977594, 8907558311654534023, 4255593540650591036, 4697507652786315355, 8795941309877878551, 5544435872844200824, 994944297945037712, 8674029532381474478, 8938215158385744024, 9175241447091780067, 8319752497800048455, 9098428809063992811, 7083019289871699050, 6392147032912275551, 959369937171193663, 775223285155485749, 6969946619945396669, -239385818321278898, 7867023678609882900, 8485756174775015178, 653524404411507543, 2712519592800561458, 784408449538585673, 7516305215489330783, 5284849260795346378, 4542935427815921456, 7491143762287671218, 5179321468636036249, -2958423916817125021, 8621454108457851488, 5343972147310513540, 7050409196988331169, 6241621443978259740, 5691844199512967388, 743728028477351888, 8410411973934585175, 6021480115957511646, -3856883873430924609, -7280028224051135209, 8185743236066043166, 4819273962897017308, 5059145346103373931, 5661813372786345482, 5599681698548794203, 6932747168268650376, 6400515060487972645, 785000777408566651, 8595379441425988714, 4145778381949139232, 3999909828137760869, 6405837028323423354, 3817823138014204958, 5053716988947964343, 2101315427911447873, -4806749523356533582, -132972332550087561, -5353424467192232738, -8802576457060150269, -3474097763156463305, 9009209169705780290, 7802793542092058234, 2662779670492735497, 3691097620139590768, 6513610463854070484, 6471493476659702419, 3650193360811123081, 3001254928584914018, 3499840302063501290, 6779972895102656116, 3131683082988308425, 3031653570599157564, 4892084020860236578, 2497511531750475769, 285759181479020292, 247802566240022100, 2284461744151074821, 8207089875905276412, -3411577389189401767, 2411677680261672874, 7324966747666123341, 2377271546849325331, 2319710447218706879, 2194167096131029618, 7258249247273455641, -5125548559804459378, -4212817902181982644, -4997823109883866192, -9119142511799409952, -3973719466666207470, -2050716514012087603, 5878268823371831765, -5038661372228412283, -6264926153255382601, 5237492490607863914, 9162069553344955750, -3338928958552776636, -4485975510199365998, -3289713503100616853, 3340398587603137395, 3269564000274247296, -3109033788701436529, 8147309304491324864, -4879276416205021178, -5513812679592304083} for 127.0.96.33
INFO  2023-06-30 02:45:46,133 [shard 0] gossip - removed 127.0.96.33 from _seeds, updated _seeds list = {127.0.96.1, 127.0.96.2}
INFO  2023-06-30 02:45:46,134 [shard 0] storage_service - Set host_id=bd2c8d8d-3751-4491-9b96-ac414b747fa7 to be owned by node=127.0.96.2
INFO  2023-06-30 02:45:46,136 [shard 0] storage_service - Skip to set host_id=ed379b1a-c563-44be-acf0-28ccd6ca7096 to be owned by node=127.0.96.3, because the node is removed from the cluster, nodes {127.0.96.3, 127.0.96.33} used to own the host_id
INFO  2023-06-30 02:45:46,137 [shard 0] gossip - InetAddress 127.0.96.3 is now DOWN, status = shutdown
INFO  2023-06-30 02:45:46,139 [shard 0] migration_manager - Requesting schema pull from 127.0.96.1:0
INFO  2023-06-30 02:45:46,139 [shard 0] migration_manager - Pulling schema from 127.0.96.1:0
INFO  2023-06-30 02:45:46,140 [shard 0] migration_manager - Requesting schema pull from 127.0.96.2:0
INFO  2023-06-30 02:45:46,140 [shard 0] migration_manager - Pulling schema from 127.0.96.2:0
INFO  2023-06-30 02:45:46,142 [shard 0] gossip - Live nodes seen in gossip: {127.0.96.1, 127.0.96.2, 127.0.96.4}
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.3, status=shutdown
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.33, status=LEFT
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.2, status=NORMAL
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.1, status=NORMAL
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.4, status=UNKNOWN
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Started waiting for normal state handler for nodes {127.0.96.1, 127.0.96.2, 127.0.96.3}

INFO  2023-06-30 02:46:16,245 [shard 0] init - Shutting down group 0 service
...
ERROR 2023-06-30 02:46:16,771 [shard 0] init - Startup failed: std::runtime_error (Failed to mark node as alive in 30000 ms, nodes={127.0.96.1, 127.0.96.2, 127.0.96.3}, live_nodes={127.0.96.1, 127.0.96.2})

The text was updated successfully, but these errors were encountered:

bhalevy · 2023-07-01T09:21:06Z

@kbr-scylla, I bisected this to this merge: 50e8ec7

bhalevy · 2023-07-02T09:00:10Z

The tests passes with the following patch:

diff --git a/service/storage_service.cc b/service/storage_service.cc
index 1771800565..9e9fecc436 100644
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -2459,11 +2459,14 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
     if (tmptr->is_normal_token_owner(endpoint)) {
         slogger.info("Node {} state jump to normal", endpoint);
     }
-    std::unordered_set<inet_address> endpoints_to_remove;
+    std::unordered_map<inet_address, bool> endpoints_to_remove;
 
-    auto do_remove_node = [&] (gms::inet_address node) {
+    auto do_remove_node = [&] (gms::inet_address node, bool force = false) {
         tmptr->remove_endpoint(node);
-        endpoints_to_remove.insert(node);
+        auto [it, inserted] = endpoints_to_remove.try_emplace(node, force);
+        if (!inserted && force) {
+            it->second = force;
+        }
     };
     // Order Matters, TM.updateHostID() should be called before TM.updateNormalToken(), (see CASSANDRA-4300).
     if (_gossiper.uses_host_id(endpoint)) {
@@ -2475,7 +2478,7 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
                 do_remove_node(endpoint);
             } else if (_gossiper.compare_endpoint_startup(endpoint, *existing) > 0) {
                 slogger.warn("Host ID collision for {} between {} and {}; {} is the new owner", host_id, *existing, endpoint, endpoint);
-                do_remove_node(*existing);
+                do_remove_node(*existing, true);
                 slogger.info("Set host_id={} to be owned by node={}, existing={}", host_id, endpoint, *existing);
                 tmptr->update_host_id(host_id, endpoint);
             } else {
@@ -2557,7 +2560,7 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
 
     for (const auto& ep : candidates_for_removal) {
         slogger.info("handle_state_normal: endpoints_to_remove endpoint={}", ep);
-        endpoints_to_remove.insert(ep);
+        endpoints_to_remove.try_emplace(ep, false);
     }
 
     bool is_normal_token_owner = tmptr->is_normal_token_owner(endpoint);
@@ -2584,8 +2587,8 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
     co_await replicate_to_all_cores(std::move(tmptr));
     tmlock.reset();
 
-    for (auto ep : endpoints_to_remove) {
-        co_await remove_endpoint(ep);
+    for (const auto& [ep, force] : endpoints_to_remove) {
+        co_await remove_endpoint(ep, force);
     }
     slogger.debug("handle_state_normal: endpoint={} is_normal_token_owner={} endpoint_to_remove={} owned_tokens={}", endpoint, is_normal_token_owner, endpoints_to_remove.contains(endpoint), owned_tokens);
     if (!owned_tokens.empty() && !endpoints_to_remove.count(endpoint)) {
@@ -3208,8 +3211,12 @@ future<> storage_service::check_for_endpoint_collision(std::unordered_set<gms::i
     });
 }
 
-future<> storage_service::remove_endpoint(inet_address endpoint) {
-    co_await _gossiper.remove_endpoint(endpoint);
+future<> storage_service::remove_endpoint(inet_address endpoint, bool force) {
+    if (force) {
+        co_await _gossiper.force_remove_endpoint(endpoint);
+    } else {
+        co_await _gossiper.remove_endpoint(endpoint);
+    }
     try {
         co_await _sys_ks.local().remove_endpoint(endpoint);
     } catch (...) {
diff --git a/service/storage_service.hh b/service/storage_service.hh
index c0ba08ac67..abf49cb104 100644
--- a/service/storage_service.hh
+++ b/service/storage_service.hh
@@ -547,7 +547,7 @@ class storage_service : public service::migration_listener, public gms::i_endpoi
     future<> excise(std::unordered_set<token> tokens, inet_address endpoint, long expire_time);
 
     /** unlike excise we just need this endpoint gone without going through any notifications **/
-    future<> remove_endpoint(inet_address endpoint);
+    future<> remove_endpoint(inet_address endpoint, bool force = false);
 
     void add_expire_time_if_found(inet_address endpoint, int64_t expire_time);

bhalevy · 2023-07-02T09:01:19Z

I'm testing this patch with the following dtest suites: update_cluster_layout_tests.py replace_address_test.py repair_based_node_operations_test.py
Will submit if it passes cleanly.

When a host changes its ip address we should force remove the previous endpoint state since we want only one endpoint to refer to this host_id. If the new node that changed its ip address is decommissioned, the previous node seems as a normal token owner, just in shutdown status, but it is not longer in the cluster. Fixes scylladb#14468 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14468.

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14468.

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

Regression test for scylladb#14468.

…es, recently replaced nodes, and recently changed IPs' from Kamil Braun Before this PR, the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test would still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. We also introduce regression tests with necessary extensions to the test framework. Fixes #14468 Fixes #14487 Closes #14507 * github.com:scylladb/scylladb: test: rename `test_topology_ip.py` to `test_replace.py` test: test bootstrap after IP change test: scylla_cluster: return the new IP from `change_ip` API test: node replace with `ignore_dead_nodes` test test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig` storage_service: remove `get_nodes_to_sync_with` storage_service: use `token_metadata` to calculate nodes waited for to be UP storage_service: don't calculate `ignore_nodes` before waiting for normal handlers

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14468.

When a host changes its ip address we should force remove the previous endpoint state since we want only one endpoint to refer to this host_id. If the new node that changed its ip address is decommissioned, the previous node seems as a normal token owner, just in shutdown status, but it is not longer in the cluster. Fixes scylladb#14468 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

When a host changes its ip address we should force remove the previous endpoint state since we want only one endpoint to refer to this host_id. If the new node that changed its ip address is decommissioned, the previous node seems as a normal token owner, just in shutdown status, but it is not longer in the cluster. Refs scylladb#14468 Fixes scylladb#13775 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…rmal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (50e8ec7). Unfortunately, as always with gossiper, there are complications. In scylladb#14468 and scylladb#14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, *after* we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in scylladb#14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.

…o be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: ca61d88 but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes scylladb#14468 Fixes scylladb#14487

Regression test for scylladb#14468.

denesb · 2023-12-15T09:49:14Z

@bhalevy @kbr-scylla do we need to backport this to 5.2? If yes, please open a backport PR.

kbr-scylla · 2023-12-15T10:03:08Z

This issue was caused by 50e8ec7 which did not end up in 5.2, so no.

bhalevy changed the title ~~test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN~~ test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown Jul 1, 2023

bhalevy added status/regression area/repair-based operations repare based node operations triage/master Looking for assignee labels Jul 2, 2023

bhalevy mentioned this issue Jul 2, 2023

storage_service: force_remove_endpoint of changed ip address #14471

Open

bhalevy added this to the 5.4 milestone Jul 3, 2023

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023

test: test bootstrap after IP change

3c03c06

Regression test for scylladb#14468.

kbr-scylla mentioned this issue Jul 4, 2023

Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nodes, recently replaced nodes, and recently changed IPs #14507

Merged

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023

test: test bootstrap after IP change

3d4896e

Regression test for scylladb#14468.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023

test: test bootstrap after IP change

ec09a1b

Regression test for scylladb#14468.

DoronArazii assigned kbr-scylla Jul 5, 2023

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 6, 2023

test: test bootstrap after IP change

452d9a3

Regression test for scylladb#14468.

DoronArazii removed the triage/master Looking for assignee label Jul 9, 2023

scylladb-promoter closed this as completed in 96278a0 Jul 10, 2023

scylladb-promoter added the Backport candidate label Jul 10, 2023

bhalevy mentioned this issue Jul 10, 2023

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error: Failed to parse node list #14594

Closed

syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Jul 10, 2023

test: test bootstrap after IP change

80d358a

Regression test for scylladb#14468.

syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Aug 28, 2023

test: test bootstrap after IP change

a54a002

Regression test for scylladb#14468.

denesb removed the Backport candidate label Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown #14468

test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown #14468

bhalevy commented Jul 1, 2023

bhalevy commented Jul 1, 2023 •

edited

bhalevy commented Jul 2, 2023

bhalevy commented Jul 2, 2023

denesb commented Dec 15, 2023

kbr-scylla commented Dec 15, 2023

test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown #14468

test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown #14468

Comments

bhalevy commented Jul 1, 2023

bhalevy commented Jul 1, 2023 • edited

bhalevy commented Jul 2, 2023

bhalevy commented Jul 2, 2023

denesb commented Dec 15, 2023

kbr-scylla commented Dec 15, 2023

bhalevy commented Jul 1, 2023 •

edited