New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Use topology nodes for repair and node operations #12506

Draft

bhalevy wants to merge 32 commits into scylladb:master from bhalevy:topology-nodes

Member

bhalevy commented Jan 12, 2023

This series depends on #11987 for introducing class node to topology.
It then uses shared node pointers to refer to nodes in node operations instead of using their endpoints,
in particular, to depend on the always-unique node host_id rather than on the endpoint that may change and may not be unique during replace node.

bhalevy requested review from tgrabiec and nyh as code owners

January 12, 2023 09:42

bhalevy requested review from avikivity, asias, gleb-cloudius and kbr-scylla

January 12, 2023 09:43

bhalevy marked this pull request as draft

January 12, 2023 09:43

Member Author

bhalevy commented Jan 12, 2023

converted to draft until #11987 is merged

Contributor

scylladb-promoter commented Jan 12, 2023

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/3810/

bhalevy force-pushed the topology-nodes branch from 75493cf to de42cc5 Compare

January 13, 2023 15:22

bhalevy mentioned this pull request

Add replace-node-first-boot option #12316

Merged

Member Author

bhalevy commented Jan 13, 2023

Depends also on #12316

bhalevy force-pushed the topology-nodes branch 2 times, most recently from bf78dc7 to 2feb46c Compare

January 18, 2023 09:47

bhalevy added 16 commits

February 20, 2023 11:53


          test: storage_proxy_test: provide valid endpoint_dc_rack

1d33109

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          locator: endpoint_dc_rack: define default operator==

fa8a8e6

and get rid of the ad-hoc implementation in network_topology_strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          locator: endpoint_dc_rack: refactor default_location

38735a3

Refactor the thread_local default_location out of
topology::get_location so it can be used elsewhere.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          shared_token_metadata: mutate_token_metadata: replicate to all shards

37550ab

storage_service::replicate_to_all_cores has a sophisticated way
to mutate the token_metadata and effective_replication_map
on shard 0 and cloning those to all other shards, applying
the changes only mutate and clone succeeded on all shards
so we don't end up with only some of the shards with the mutated
copy if an error happend mid-way (and then we would need to
roll-back the change for exception safety).

shared_token_metadata::mutate_token_metadata is currently only called from
a unit test that needs to mutate the token metadata only on shard 0,
but a following patch will require doing that on all shards.

This change adds this capbility by enforcing the call to be
on shard 0m mutating the token_metdata into a temporary pending copy
and cloning it on all other shards.  Only then, when all shard
succeeded, set the modified token_metadata on all shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          locator: add class node

32ba293

And keep per node information (host_id, endpoint, dc_rack, is_pending)
in node objects, indexed by topology on several indices like:
host_id, endpoint, current/pending, per dc, per dc/rack.

Refs scylladb#6403

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          topology: node_ptr: add operator<<

ff5d7b6

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          topology: add node idx

acc7a1c

The node index is a shothand identifier for the node.
To be used for mapping nodes in token metadata
and node operations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          topology: add node state

c58699a

Add a simple node state model with:
joining, normal, leaving, left states
to help managing nodes during replace
with the the same ip address.

Later on, this could also help prevent nodes
that were decommissioned, removed, or replaced
from rejoining the cluster.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          utils: fb_utilities: add host_id

118b266

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          everywhere: unalias netw::messaging_service::msg_addr

0cd23d4

Use netw::msg_addr directly instead, so fmt::formatter
would know to use std::operator<<(..., const netw::msg_addr&),
as it gets confused by type aliasing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          netw: msg_addr: move hash and print function to namespace std

cad16d4

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          message: msg_addr: add host_id

285fb87

So we can associate the inet_address with a unique host_id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          repair: node_ops_cmd_request: add node_ptr based make_node_ops_cmd_re…

ad88152

…quest

To be used by storage_service to make node_ops commands
based on node_ptr:s rather than endpoints.

Prepare for v2, where endpoints will encode the node index
as a raw ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          topology: node: add msg_addr getter

c3c258b

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          storage_service: decommission: use topology nodes

a01e0c0

Use topology to get to the local node and track
other nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          storage_service: run_bootstrap_ops: use topology nodes

0843faf

Use topology to get to the local node and track
other nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Also, base the list of nodes to sync with
on the token_metadata get_endpoint_to_host_id_map_for_reading
thta lists all normal token owners, rather than on the nodes
in gossip, as the token_metadata is the source of truth
for node operation as in all other ops.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy added 16 commits

February 20, 2023 11:54


          host_id_or_endpoint: resolve: return node_ptr

6a7433b

Use std::variant<host_id, inet_address> to keep
either of them and resolve using topology
to a node_ptr.

Reflect that also from token_metadata::parse_host_id_and_endpoint
that gets a string describing either a host_id or inet_address
and returns a node_ptr if found.

Exceptions are thrown if the string is invalid
or if the node is not found in topology.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          storage_service: get_ignore_dead_nodes_for_replace: return node_set

7a917f5

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          storage_service: run_replace_ops: use topology nodes

e3f1e2d

Use topology to get to track
nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Also, base the list of nodes to sync with
on the token_metadata get_endpoint_to_host_id_map_for_reading
thta lists all normal token owners, rather than on the nodes
in gossip, as the token_metadata is the source of truth
for node operation as in all other ops.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          storage_service: remove_node: use topology nodes

4b9aad7

Use topology to get to the leaving node and track
other nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          gms: feature_service: add NODE_OPS_CMD_V2 feature

be4b3a4

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          repair: node_ops_cmd_request: support node_ops_cmd_v2

7febfb8

With node_ops_cmd_v2, use the node index as the
inet_address in the existing node_ops_cmd_request fields
and send an additional dictionary mapping each node index
to a pair of <host_id, inet_address>.

Provide a set of getters that extract either the legacy
field from the request if the dictionary isn't provided,
or use the dictionary to translate the raw addresses,
that contain node indexes into the real node address.

A later patch, will use the host_id in the dictionary
to look up nodes in storage_service::node_ops_cmd_handler
and both the host_id and inet_address to populate
new nodes in topology for the replace and bootstrap case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          log: add log_{warning,error}_and_throw

dc6cb65

Add helpers that print a warning or error to the log
and then throw an exception with that message, as this
is a common pattern we use.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          node_ops_cmd_request: get ignore_nodes as locator::node_set

be5ab83

As a first step, find each node to ignore given
its endpoint or host_id, if available.

At this stage, translate those back into a list
of endpoint when passing to node_ops_meta_data.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          node_ops_meta_data: pass ignore_nodes as locator::node_set to node_op…

a91b6e7

…s_info

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          repair: do_rebuild_replace_with_repair: use node_set ignore_nodes

fef6613

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          repair: use node_set ignore_nodes

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          node_ops_cmd_request: get leaving_nodes as locator::node_set

8b01214

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          storage_service, repair: use node_ptr for removenode and decommission

1b1cd35

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          repair: node_ops_cmd_request: get_replace_nodes: return node_ptr map

95137d1

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          locator: token_metadata: pass bootstrap tokens by ref

acfa537

No need to pass the tokens by value, since
add_bootstrap_token(s) and remove_bootstrap_tokens
are synchronous.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>


          repair: node_ops_cmd_request: get_bootstrap_nodes: return node_ptr map

3de0b79

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy force-pushed the topology-nodes branch from 2feb46c to 3de0b79 Compare

February 20, 2023 10:53

bhalevy requested review from denesb and removed request for tgrabiec and nyh

February 21, 2023 13:49

denesb reviewed

View reviewed changes

test/boost/gossiping_property_file_snitch_test.cc

                   utils::fb_utilities::set_broadcast_address(gms::inet_address("localhost"));
                   utils::fb_utilities::set_broadcast_rpc_address(gms::inet_address("localhost"));
+                  utils::fb_utilities::set_host_id(locator::host_id::create_random_id());

Contributor

denesb Feb 22, 2023

We should really find a better place for the local address, than a function-local global. For one, it is really hard to obtain it in gdb.

Maybe not in this PR.

Member Author

bhalevy Mar 6, 2023

I agree. That could be done as a follow up to #11987
Maybe topology itself would actually be the best place to keep the local address and host_id, on the local node.

api/messaging_service.cc

@@ @@ -19,7 +19,6 @@ using namespace netw; @@
               namespace api {
               using shard_info = messaging_service::shard_info;
-              using msg_addr = messaging_service::msg_addr;

Contributor

denesb Feb 22, 2023

This (and the changes to accommodate fir it) looks like it should be part of the previous commit.

Member Author

bhalevy Mar 6, 2023

okay, will look into it for next version

message/msg_addr.hh

               namespace netw {
               struct msg_addr {
                   gms::inet_address addr;
                   uint32_t cpu_id;
+                  locator::host_id host_id;

Contributor

denesb Feb 22, 2023

I'm not sure about this.
Is this a temporary measure to bridge us to the time when class node is used everywhere?
This effectively limits msg_addr to inter-node RPC usage. Maybe that is fine, it is in the messaging/ "module" after all.

Contributor

denesb Feb 22, 2023

Actually, I like how natural this makes it to pass nodes to messaging layer.

repair/repair.cc

+                  auto req = node_ops_cmd_request(cmd, uuid, std::move(tables));
+                  auto get_endpoint = [] (const locator::node_ptr& node) {
+                      // Prep for v2

Contributor

denesb Feb 22, 2023

v2 of what?

Contributor

denesb Feb 23, 2023

I now know that it refers to v2 of the cmd request verb structure, but at this point in the PR it was not at all obvious.

Member Author

bhalevy Mar 6, 2023

okay, will clarify

service/storage_service.cc

                   auto start_time = std::chrono::steady_clock::now();
                   for (;;) {
                       sync_nodes.clear();
                       // Step 1: Decide who needs to sync data for bootstrap operation
-                      for (const auto& x :_gossiper.get_endpoint_states()) {
+                      for (const auto& [ep, host_id] : tmptr->get_endpoint_to_host_id_map_for_reading()) {

Contributor

denesb Feb 22, 2023

If this change is a correctness issue, it might worth extracting it into a separate commit, in case we can trace some bug to it.

Member Author

bhalevy Mar 6, 2023

you're absolutely correct.
It's #13066

service/storage_service.cc

-                              sync_nodes.push_back(node);
+                          auto node = topo.find_node(host_id, locator::topology::must_exist::yes);
+                          slogger.info("bootstrap[{}]: Check node={}, status={}", uuid, node, _gossiper.get_gossip_status(node->endpoint()));
+                          if (node != local_node && !ignore_nodes.contains(node)) {

Contributor

denesb Feb 22, 2023

node != local_node could be just removed here, and then we'd not need to explicitly add local node below.

Member Author

bhalevy Mar 6, 2023

right, it should be done in a separate change though, as this is just translation of the current logic to use node ptrs.
and #12799 and this series turn sync_nodes into an unordered_set so the insertion order doesn't matter.
Sending is done in parallel anyhow so I see no point special casing the local node anyhow.

locator/token_metadata.hh

+                  }
+                  gms::inet_address endpoint() const {
+                      return std::get<gms::inet_address>(data);

Contributor

denesb Feb 23, 2023

Was the host_id/endpoint exclusive also before?

Member Author

bhalevy Mar 6, 2023

it didn't have to be, but in case both were set, we should have preferred the host_id.

Member Author

bhalevy Mar 6, 2023

now, for having both, we have host_id_and_endpoint defined below which is a simple struct containing both members.

idl/partition_checksum.idl.hh

+                  // v2:
+                  // Optional field, maps node index to (host_id, inet_address)
+                  // gms:inet_address values above will contain the node index as a raw ipv4 address
+                  std::unordered_map<locator::node::idx_type, locator::host_id_and_endpoint> nodes_dict;

Contributor

denesb Feb 23, 2023

How can we send node indexes across nodes? The remove might have a completely different node index <-> node mapping.
This seems quite hacky to me. I think we should just use a completely new verb and structure in v2.

Contributor

denesb Feb 23, 2023

Later I understood that the node indexes are into this map here that is also transferred on the wire. So that is fine. Still think this is quite hacky.

Member Author

bhalevy Mar 6, 2023

yeah, it's a way to extend the rpc in a backward compatible way without requiring a new set of verbs and handlers.

repair/repair.cc

+                      .rack = (dc && rack) ? rack->value : locator::endpoint_dc_rack::default_location.rack,
+                  };
+                  return topo.add_node(haep.host_id, haep.endpoint, dc_rack, state);
+              }

Contributor

denesb Feb 23, 2023

I'm confused: isn't topology the definitive source of node information? Why are we getting node data from gossip and adding it to topology?

Member Author

bhalevy Mar 6, 2023

We learn about new nodes from node operations, where this function is called from.
gossip is only used here to get additional node attributed like dc and rack

bhalevy mentioned this pull request

token_metadata: use topology nodes for endpoint_to_host_id map #13476

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

denesb denesb left review comments

avikivity Awaiting requested review from avikivity

asias Awaiting requested review from asias

gleb-cloudius Awaiting requested review from gleb-cloudius

kbr-scylla Awaiting requested review from kbr-scylla

piodul Awaiting requested review from piodul piodul will be requested when the pull request is marked ready for review piodul is a code owner

tgrabiec Awaiting requested review from tgrabiec tgrabiec will be requested when the pull request is marked ready for review tgrabiec is a code owner