Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown #14468

Closed
bhalevy opened this issue Jul 1, 2023 · 5 comments · Fixed by #14507
Assignees
Labels
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Jul 1, 2023

Seen first in https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/290/testReport/update_cluster_layout_tests/TestUpdateClusterLayout/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_decommission_after_changing_node_ip/

RuntimeError: The process is dead, returncode=1

https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/290/artifact/logs-full.release.010/1688093180144_update_cluster_layout_tests.py%3A%3ATestUpdateClusterLayout%3A%3Atest_decommission_after_changing_node_ip/node4.log

Scylla version 5.4.0~dev-0.20230630.ff386e7a445d with build-id 228b9c341f6b5404947f4a2020d4a25eeaacebbe starting ...

INFO  2023-06-30 02:45:44,589 [shard 0] init - seeds={127.0.96.1, 127.0.96.2, 127.0.96.4, 127.0.96.33}, listen_address=127.0.96.4, broadcast_address=127.0.96.4

INFO  2023-06-30 02:45:44,675 [shard 0] storage_service - Starting up server gossip
INFO  2023-06-30 02:45:44,683 [shard 0] compaction - [Compact system.local 35def3b0-16f0-11ee-a3e4-02af61fe3371] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-0iu1ozg8/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/mc-2-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-0iu1ozg8/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/mc-4-big-Data.db:level=0:origin=memtable]
INFO  2023-06-30 02:45:44,683 [shard 0] gossip - failure_detector_loop: Started main loop
INFO  2023-06-30 02:45:44,683 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-06-30 02:45:44,693 [shard 0] compaction - [Compact system.local 35def3b0-16f0-11ee-a3e4-02af61fe3371] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-0iu1ozg8/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/mc-6-big-Data.db:level=0]. 13kB to 7kB (~55% of original) in 7ms = 1MB/s. ~256 total partitions merged to 1.
INFO  2023-06-30 02:45:46,131 [shard 0] storage_service - Set host_id=97569d56-628c-4485-b0e6-1c68c40c537f to be owned by node=127.0.96.1
INFO  2023-06-30 02:45:46,132 [shard 0] gossip - InetAddress 127.0.96.33 is now DOWN, status = LEFT
WARN  2023-06-30 02:45:46,132 [shard 0] gossip - Fail to send EchoMessage to 127.0.96.3: seastar::rpc::closed_error (connection is closed)
INFO  2023-06-30 02:45:46,132 [shard 0] gossip - InetAddress 127.0.96.1 is now UP, status = NORMAL
INFO  2023-06-30 02:45:46,132 [shard 0] gossip - InetAddress 127.0.96.2 is now UP, status = NORMAL
INFO  2023-06-30 02:45:46,133 [shard 0] gossip - Node 127.0.96.33 will be removed from gossip at [2023-07-03 02:45:19]: (expire = 1688352319445716595, now = 1688093146132961662, diff = 259173 seconds)
INFO  2023-06-30 02:45:46,133 [shard 0] storage_service - Removing tokens {1957138825282075456, 538228788046155918, 1441663362887118471, 2009889732205232147, 1285921624202456720, 2299813546280369809, 1147922929401656711, 3437689283141886193, 1093939752358477342, -95578895009921325, -933537828167744695, -4822175528068997885, -9064814390348450199, -2838398123611190415, -8865697753194005691, -8641240172602522619, -8527192015627880935, -8455721658642346121, -1602885157652809629, -826999321362664251, -8004313790603409879, -7983704426324750105, 6745400184375170269, 1897403336345133737, -7785637751534354940, 6402350960772843634, -9131862161278302934, 7015699157901876878, -7507635517787229617, -7872263450865226137, -2398561151223758453, -2757858320304448776, -7135017416280759518, 762591318120088757, -7036247732884944684, -4706718406784561136, 772993406584974221, -6901710291266973174, -6873621148550053390, -131764165975166161, -6467486352483923794, -8723719628383537842, -7239257021703251755, -8876209735699572524, -6300796970024261941, -633462019235493849, -6310403927118869660, -1431497748581863464, 3697526867126198196, 4923690827301917877, 8112265936827023660, -6277863917912802240, -6623626227015086286, -6263567382473675505, 2044340043996898806, -1684759488608644447, 4219519608993832438, 622262989057717658, -6132029616733649227, -2863638381489490074, -6093224496148114428, -7168585208466197697, 82753073108884114, -5933252256733306442, -5452048513549809665, 3547905835448042745, -5895694438770948515, 7045113366380893923, -6640938570355561189, -5660282275830501698, -3728555925078188231, -4034527578391978588, -5233014222012667101, -8844096475636041617, -2839352313897520511, -917431164596666618, 3618597186644201391, 1638066742741535716, -5023041224884994968, -2616649082261222559, -2415888436872385330, -2940916223950425714, -8994438244973781987, -833074952334662081, -4426602942396513532, 649128337838919561, 7775225799600157744, 1580061385397170636, -2402418775007732133, -7427857065668640033, -1049799291451885300, -2550242859911733562, 5198876569107239321, -8301793316585325178, 5554285130267436886, -8058007130247916491, -4088885882961805256, -7332104044879353661, -1732367245764070833, -5214440851373603397, -4028381155067564419, 741098824345071851, -2076353771709165172, 2095974798297043788, 3679342026564125770, -637331905247054159, -42107836694803384, 6736375668058457916, -231760654102333982, 4133801035329363389, -129951755050233280, -1712843812334547699, -2812010428323279743, -8972822811092831737, -5934804369128314387, -1911244404117897093, 7399718564801560588, -426914257413052499, 1951602158941318813, -2310239841370092592, -5096146377327979133, -5353838904714672202, 7416808377702594272, -2762853363408437030, -1987318027109743947, -2698847702965464001, -2389636933915703465, -1642036105268827801, 729012831382377832, 4192467494245733578, 453328910274435383, 4604210925835301206, -4861600011407716479, -8866456080701519922, 4710686227407048955, 4060438571529584411, 8082128952710491799, -8917743902008465042, 4830388162512167990, 5405228356480841250, 3896618241085296806, 7483054753003347861, 4848712762403721656, 8155890873069566919, 6050126936624620059, 8143687424717096518, 5099342403467628764, 8193264499918054230, 8364320176004023673, 7464679094141006861, 2369583153774542736, 4661971083725460006, 6557277420237977594, 8907558311654534023, 4255593540650591036, 4697507652786315355, 8795941309877878551, 5544435872844200824, 994944297945037712, 8674029532381474478, 8938215158385744024, 9175241447091780067, 8319752497800048455, 9098428809063992811, 7083019289871699050, 6392147032912275551, 959369937171193663, 775223285155485749, 6969946619945396669, -239385818321278898, 7867023678609882900, 8485756174775015178, 653524404411507543, 2712519592800561458, 784408449538585673, 7516305215489330783, 5284849260795346378, 4542935427815921456, 7491143762287671218, 5179321468636036249, -2958423916817125021, 8621454108457851488, 5343972147310513540, 7050409196988331169, 6241621443978259740, 5691844199512967388, 743728028477351888, 8410411973934585175, 6021480115957511646, -3856883873430924609, -7280028224051135209, 8185743236066043166, 4819273962897017308, 5059145346103373931, 5661813372786345482, 5599681698548794203, 6932747168268650376, 6400515060487972645, 785000777408566651, 8595379441425988714, 4145778381949139232, 3999909828137760869, 6405837028323423354, 3817823138014204958, 5053716988947964343, 2101315427911447873, -4806749523356533582, -132972332550087561, -5353424467192232738, -8802576457060150269, -3474097763156463305, 9009209169705780290, 7802793542092058234, 2662779670492735497, 3691097620139590768, 6513610463854070484, 6471493476659702419, 3650193360811123081, 3001254928584914018, 3499840302063501290, 6779972895102656116, 3131683082988308425, 3031653570599157564, 4892084020860236578, 2497511531750475769, 285759181479020292, 247802566240022100, 2284461744151074821, 8207089875905276412, -3411577389189401767, 2411677680261672874, 7324966747666123341, 2377271546849325331, 2319710447218706879, 2194167096131029618, 7258249247273455641, -5125548559804459378, -4212817902181982644, -4997823109883866192, -9119142511799409952, -3973719466666207470, -2050716514012087603, 5878268823371831765, -5038661372228412283, -6264926153255382601, 5237492490607863914, 9162069553344955750, -3338928958552776636, -4485975510199365998, -3289713503100616853, 3340398587603137395, 3269564000274247296, -3109033788701436529, 8147309304491324864, -4879276416205021178, -5513812679592304083} for 127.0.96.33
INFO  2023-06-30 02:45:46,133 [shard 0] gossip - removed 127.0.96.33 from _seeds, updated _seeds list = {127.0.96.1, 127.0.96.2}
INFO  2023-06-30 02:45:46,134 [shard 0] storage_service - Set host_id=bd2c8d8d-3751-4491-9b96-ac414b747fa7 to be owned by node=127.0.96.2
INFO  2023-06-30 02:45:46,136 [shard 0] storage_service - Skip to set host_id=ed379b1a-c563-44be-acf0-28ccd6ca7096 to be owned by node=127.0.96.3, because the node is removed from the cluster, nodes {127.0.96.3, 127.0.96.33} used to own the host_id
INFO  2023-06-30 02:45:46,137 [shard 0] gossip - InetAddress 127.0.96.3 is now DOWN, status = shutdown
INFO  2023-06-30 02:45:46,139 [shard 0] migration_manager - Requesting schema pull from 127.0.96.1:0
INFO  2023-06-30 02:45:46,139 [shard 0] migration_manager - Pulling schema from 127.0.96.1:0
INFO  2023-06-30 02:45:46,140 [shard 0] migration_manager - Requesting schema pull from 127.0.96.2:0
INFO  2023-06-30 02:45:46,140 [shard 0] migration_manager - Pulling schema from 127.0.96.2:0
INFO  2023-06-30 02:45:46,142 [shard 0] gossip - Live nodes seen in gossip: {127.0.96.1, 127.0.96.2, 127.0.96.4}
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.3, status=shutdown
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.33, status=LEFT
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.2, status=NORMAL
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.1, status=NORMAL
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Check node=127.0.96.4, status=UNKNOWN
INFO  2023-06-30 02:45:46,142 [shard 0] storage_service - Started waiting for normal state handler for nodes {127.0.96.1, 127.0.96.2, 127.0.96.3}

INFO  2023-06-30 02:46:16,245 [shard 0] init - Shutting down group 0 service
...
ERROR 2023-06-30 02:46:16,771 [shard 0] init - Startup failed: std::runtime_error (Failed to mark node as alive in 30000 ms, nodes={127.0.96.1, 127.0.96.2, 127.0.96.3}, live_nodes={127.0.96.1, 127.0.96.2})
@bhalevy bhalevy changed the title test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN test_decommission_after_changing_node_ip: node4 fails to start when stale ip address of node3 appears as DOWN, status=shutdown Jul 1, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Jul 1, 2023

@kbr-scylla, I bisected this to this merge: 50e8ec7

@bhalevy bhalevy added status/regression area/repair-based operations repare based node operations triage/master Looking for assignee labels Jul 2, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Jul 2, 2023

The tests passes with the following patch:

diff --git a/service/storage_service.cc b/service/storage_service.cc
index 1771800565..9e9fecc436 100644
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -2459,11 +2459,14 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
     if (tmptr->is_normal_token_owner(endpoint)) {
         slogger.info("Node {} state jump to normal", endpoint);
     }
-    std::unordered_set<inet_address> endpoints_to_remove;
+    std::unordered_map<inet_address, bool> endpoints_to_remove;
 
-    auto do_remove_node = [&] (gms::inet_address node) {
+    auto do_remove_node = [&] (gms::inet_address node, bool force = false) {
         tmptr->remove_endpoint(node);
-        endpoints_to_remove.insert(node);
+        auto [it, inserted] = endpoints_to_remove.try_emplace(node, force);
+        if (!inserted && force) {
+            it->second = force;
+        }
     };
     // Order Matters, TM.updateHostID() should be called before TM.updateNormalToken(), (see CASSANDRA-4300).
     if (_gossiper.uses_host_id(endpoint)) {
@@ -2475,7 +2478,7 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
                 do_remove_node(endpoint);
             } else if (_gossiper.compare_endpoint_startup(endpoint, *existing) > 0) {
                 slogger.warn("Host ID collision for {} between {} and {}; {} is the new owner", host_id, *existing, endpoint, endpoint);
-                do_remove_node(*existing);
+                do_remove_node(*existing, true);
                 slogger.info("Set host_id={} to be owned by node={}, existing={}", host_id, endpoint, *existing);
                 tmptr->update_host_id(host_id, endpoint);
             } else {
@@ -2557,7 +2560,7 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
 
     for (const auto& ep : candidates_for_removal) {
         slogger.info("handle_state_normal: endpoints_to_remove endpoint={}", ep);
-        endpoints_to_remove.insert(ep);
+        endpoints_to_remove.try_emplace(ep, false);
     }
 
     bool is_normal_token_owner = tmptr->is_normal_token_owner(endpoint);
@@ -2584,8 +2587,8 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
     co_await replicate_to_all_cores(std::move(tmptr));
     tmlock.reset();
 
-    for (auto ep : endpoints_to_remove) {
-        co_await remove_endpoint(ep);
+    for (const auto& [ep, force] : endpoints_to_remove) {
+        co_await remove_endpoint(ep, force);
     }
     slogger.debug("handle_state_normal: endpoint={} is_normal_token_owner={} endpoint_to_remove={} owned_tokens={}", endpoint, is_normal_token_owner, endpoints_to_remove.contains(endpoint), owned_tokens);
     if (!owned_tokens.empty() && !endpoints_to_remove.count(endpoint)) {
@@ -3208,8 +3211,12 @@ future<> storage_service::check_for_endpoint_collision(std::unordered_set<gms::i
     });
 }
 
-future<> storage_service::remove_endpoint(inet_address endpoint) {
-    co_await _gossiper.remove_endpoint(endpoint);
+future<> storage_service::remove_endpoint(inet_address endpoint, bool force) {
+    if (force) {
+        co_await _gossiper.force_remove_endpoint(endpoint);
+    } else {
+        co_await _gossiper.remove_endpoint(endpoint);
+    }
     try {
         co_await _sys_ks.local().remove_endpoint(endpoint);
     } catch (...) {
diff --git a/service/storage_service.hh b/service/storage_service.hh
index c0ba08ac67..abf49cb104 100644
--- a/service/storage_service.hh
+++ b/service/storage_service.hh
@@ -547,7 +547,7 @@ class storage_service : public service::migration_listener, public gms::i_endpoi
     future<> excise(std::unordered_set<token> tokens, inet_address endpoint, long expire_time);
 
     /** unlike excise we just need this endpoint gone without going through any notifications **/
-    future<> remove_endpoint(inet_address endpoint);
+    future<> remove_endpoint(inet_address endpoint, bool force = false);
 
     void add_expire_time_if_found(inet_address endpoint, int64_t expire_time);
 

@bhalevy
Copy link
Member Author

bhalevy commented Jul 2, 2023

I'm testing this patch with the following dtest suites: update_cluster_layout_tests.py replace_address_test.py repair_based_node_operations_test.py
Will submit if it passes cleanly.

bhalevy added a commit to bhalevy/scylla that referenced this issue Jul 2, 2023
When a host changes its ip address we should force remove
the previous endpoint state since we want only one endpoint
to refer to this host_id.

If the new node that changed its ip address is decommissioned,
the previous node seems as a normal token owner, just in shutdown
status, but it is not longer in the cluster.

Fixes scylladb#14468

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy bhalevy added this to the 5.4 milestone Jul 3, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 4, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 6, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jul 6, 2023
@DoronArazii DoronArazii removed the triage/master Looking for assignee label Jul 9, 2023
tgrabiec added a commit that referenced this issue Jul 9, 2023
…es, recently replaced nodes, and recently changed IPs' from Kamil Braun

Before this PR, the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

It turns out that we don't need to calculate `sync_nodes` (and
hence `ignore_nodes`) in order to wait for NORMAL state handlers. We
can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries
appearing in gossiper, even those that correspond to dead/ignored
nodes and obsolete IPs.  `handle_state_normal` is called, and
eventually finishes, for all of them.
`wait_for_normal_state_handled_on_boot` no longer receives a set of
nodes as parameter and is modified appropriately, it's now calculating
the necessary set of nodes on each retry (the set may shrink while
we're waiting, e.g. because an entry corresponding to a node that was
replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in #14487, but the test
would still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

We also introduce regression tests with necessary extensions
to the test framework.

Fixes #14468
Fixes #14487

Closes #14507

* github.com:scylladb/scylladb:
  test: rename `test_topology_ip.py` to `test_replace.py`
  test: test bootstrap after IP change
  test: scylla_cluster: return the new IP from `change_ip` API
  test: node replace with `ignore_dead_nodes` test
  test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`
  storage_service: remove `get_nodes_to_sync_with`
  storage_service: use `token_metadata` to calculate nodes waited for to be UP
  storage_service: don't calculate `ignore_nodes` before waiting for normal handlers
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Jul 10, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Jul 10, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Jul 10, 2023
bhalevy added a commit to bhalevy/scylla that referenced this issue Jul 24, 2023
When a host changes its ip address we should force remove
the previous endpoint state since we want only one endpoint
to refer to this host_id.

If the new node that changed its ip address is decommissioned,
the previous node seems as a normal token owner, just in shutdown
status, but it is not longer in the cluster.

Fixes scylladb#14468

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Aug 3, 2023
When a host changes its ip address we should force remove
the previous endpoint state since we want only one endpoint
to refer to this host_id.

If the new node that changed its ip address is decommissioned,
the previous node seems as a normal token owner, just in shutdown
status, but it is not longer in the cluster.

Fixes scylladb#14468

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Aug 6, 2023
When a host changes its ip address we should force remove
the previous endpoint state since we want only one endpoint
to refer to this host_id.

If the new node that changed its ip address is decommissioned,
the previous node seems as a normal token owner, just in shutdown
status, but it is not longer in the cluster.

Refs scylladb#14468
Fixes scylladb#13775

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Aug 6, 2023
When a host changes its ip address we should force remove
the previous endpoint state since we want only one endpoint
to refer to this host_id.

If the new node that changed its ip address is decommissioned,
the previous node seems as a normal token owner, just in shutdown
status, but it is not longer in the cluster.

Refs scylladb#14468
Fixes scylladb#13775

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Aug 28, 2023
…rmal handlers

Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec7).

Unfortunately, as always with gossiper, there are complications.
In scylladb#14468 and scylladb#14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in scylladb#14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Aug 28, 2023
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Aug 28, 2023
@denesb
Copy link
Contributor

denesb commented Dec 15, 2023

@bhalevy @kbr-scylla do we need to backport this to 5.2? If yes, please open a backport PR.

@kbr-scylla
Copy link
Contributor

This issue was caused by 50e8ec7 which did not end up in 5.2, so no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants