Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use topology nodes for repair and node operations #12506

Draft
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

bhalevy
Copy link
Member

@bhalevy bhalevy commented Jan 12, 2023

This series depends on #11987 for introducing class node to topology.
It then uses shared node pointers to refer to nodes in node operations instead of using their endpoints,
in particular, to depend on the always-unique node host_id rather than on the endpoint that may change and may not be unique during replace node.

@bhalevy
Copy link
Member Author

bhalevy commented Jan 12, 2023

converted to draft until #11987 is merged

@scylladb-promoter
Copy link
Contributor

@bhalevy
Copy link
Member Author

bhalevy commented Jan 13, 2023

Depends also on #12316

@bhalevy bhalevy force-pushed the topology-nodes branch 2 times, most recently from bf78dc7 to 2feb46c Compare January 18, 2023 09:47
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and get rid of the ad-hoc implementation in network_topology_strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the thread_local default_location out of
topology::get_location so it can be used elsewhere.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
storage_service::replicate_to_all_cores has a sophisticated way
to mutate the token_metadata and effective_replication_map
on shard 0 and cloning those to all other shards, applying
the changes only mutate and clone succeeded on all shards
so we don't end up with only some of the shards with the mutated
copy if an error happend mid-way (and then we would need to
roll-back the change for exception safety).

shared_token_metadata::mutate_token_metadata is currently only called from
a unit test that needs to mutate the token metadata only on shard 0,
but a following patch will require doing that on all shards.

This change adds this capbility by enforcing the call to be
on shard 0m mutating the token_metdata into a temporary pending copy
and cloning it on all other shards.  Only then, when all shard
succeeded, set the modified token_metadata on all shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep per node information (host_id, endpoint, dc_rack, is_pending)
in node objects, indexed by topology on several indices like:
host_id, endpoint, current/pending, per dc, per dc/rack.

Refs scylladb#6403

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The node index is a shothand identifier for the node.
To be used for mapping nodes in token metadata
and node operations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a simple node state model with:
joining, normal, leaving, left states
to help managing nodes during replace
with the the same ip address.

Later on, this could also help prevent nodes
that were decommissioned, removed, or replaced
from rejoining the cluster.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use netw::msg_addr directly instead, so fmt::formatter
would know to use std::operator<<(..., const netw::msg_addr&),
as it gets confused by type aliasing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can associate the inet_address with a unique host_id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…quest

To be used by storage_service to make node_ops commands
based on node_ptr:s rather than endpoints.

Prepare for v2, where endpoints will encode the node index
as a raw ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to the local node and track
other nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to the local node and track
other nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Also, base the list of nodes to sync with
on the token_metadata get_endpoint_to_host_id_map_for_reading
thta lists all normal token owners, rather than on the nodes
in gossip, as the token_metadata is the source of truth
for node operation as in all other ops.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use std::variant<host_id, inet_address> to keep
either of them and resolve using topology
to a node_ptr.

Reflect that also from token_metadata::parse_host_id_and_endpoint
that gets a string describing either a host_id or inet_address
and returns a node_ptr if found.

Exceptions are thrown if the string is invalid
or if the node is not found in topology.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to track
nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Also, base the list of nodes to sync with
on the token_metadata get_endpoint_to_host_id_map_for_reading
thta lists all normal token owners, rather than on the nodes
in gossip, as the token_metadata is the source of truth
for node operation as in all other ops.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to the leaving node and track
other nodes to sync with all the way to
node_ops_cmd_heartbeat_updater and send_node_ops_cmd.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With node_ops_cmd_v2, use the node index as the
inet_address in the existing node_ops_cmd_request fields
and send an additional dictionary mapping each node index
to a pair of <host_id, inet_address>.

Provide a set of getters that extract either the legacy
field from the request if the dictionary isn't provided,
or use the dictionary to translate the raw addresses,
that contain node indexes into the real node address.

A later patch, will use the host_id in the dictionary
to look up nodes in storage_service::node_ops_cmd_handler
and both the host_id and inet_address to populate
new nodes in topology for the replace and bootstrap case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add helpers that print a warning or error to the log
and then throw an exception with that message, as this
is a common pattern we use.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As a first step, find each node to ignore given
its endpoint or host_id, if available.

At this stage, translate those back into a list
of endpoint when passing to node_ops_meta_data.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…s_info

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to pass the tokens by value, since
add_bootstrap_token(s) and remove_bootstrap_tokens
are synchronous.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy bhalevy requested review from denesb and removed request for tgrabiec and nyh February 21, 2023 13:49
@@ -34,6 +34,7 @@ future<> one_test(const std::string& property_fname, bool exp_result) {

utils::fb_utilities::set_broadcast_address(gms::inet_address("localhost"));
utils::fb_utilities::set_broadcast_rpc_address(gms::inet_address("localhost"));
utils::fb_utilities::set_host_id(locator::host_id::create_random_id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really find a better place for the local address, than a function-local global. For one, it is really hard to obtain it in gdb.

Maybe not in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. That could be done as a follow up to #11987
Maybe topology itself would actually be the best place to keep the local address and host_id, on the local node.

@@ -19,7 +19,6 @@ using namespace netw;
namespace api {

using shard_info = messaging_service::shard_info;
using msg_addr = messaging_service::msg_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the changes to accommodate fir it) looks like it should be part of the previous commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, will look into it for next version


namespace netw {

struct msg_addr {
gms::inet_address addr;
uint32_t cpu_id;
locator::host_id host_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this.
Is this a temporary measure to bridge us to the time when class node is used everywhere?
This effectively limits msg_addr to inter-node RPC usage. Maybe that is fine, it is in the messaging/ "module" after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I like how natural this makes it to pass nodes to messaging layer.

auto req = node_ops_cmd_request(cmd, uuid, std::move(tables));

auto get_endpoint = [] (const locator::node_ptr& node) {
// Prep for v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 of what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now know that it refers to v2 of the cmd request verb structure, but at this point in the PR it was not at all obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, will clarify


auto start_time = std::chrono::steady_clock::now();
for (;;) {
sync_nodes.clear();
// Step 1: Decide who needs to sync data for bootstrap operation
for (const auto& x :_gossiper.get_endpoint_states()) {
for (const auto& [ep, host_id] : tmptr->get_endpoint_to_host_id_map_for_reading()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this change is a correctness issue, it might worth extracting it into a separate commit, in case we can trace some bug to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're absolutely correct.
It's #13066

sync_nodes.push_back(node);
auto node = topo.find_node(host_id, locator::topology::must_exist::yes);
slogger.info("bootstrap[{}]: Check node={}, status={}", uuid, node, _gossiper.get_gossip_status(node->endpoint()));
if (node != local_node && !ignore_nodes.contains(node)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node != local_node could be just removed here, and then we'd not need to explicitly add local node below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it should be done in a separate change though, as this is just translation of the current logic to use node ptrs.
and #12799 and this series turn sync_nodes into an unordered_set so the insertion order doesn't matter.
Sending is done in parallel anyhow so I see no point special casing the local node anyhow.

}

gms::inet_address endpoint() const {
return std::get<gms::inet_address>(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the host_id/endpoint exclusive also before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it didn't have to be, but in case both were set, we should have preferred the host_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, for having both, we have host_id_and_endpoint defined below which is a simple struct containing both members.

// v2:
// Optional field, maps node index to (host_id, inet_address)
// gms:inet_address values above will contain the node index as a raw ipv4 address
std::unordered_map<locator::node::idx_type, locator::host_id_and_endpoint> nodes_dict;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we send node indexes across nodes? The remove might have a completely different node index <-> node mapping.
This seems quite hacky to me. I think we should just use a completely new verb and structure in v2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later I understood that the node indexes are into this map here that is also transferred on the wire. So that is fine. Still think this is quite hacky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's a way to extend the rpc in a backward compatible way without requiring a new set of verbs and handlers.

.rack = (dc && rack) ? rack->value : locator::endpoint_dc_rack::default_location.rack,
};
return topo.add_node(haep.host_id, haep.endpoint, dc_rack, state);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused: isn't topology the definitive source of node information? Why are we getting node data from gossip and adding it to topology?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We learn about new nodes from node operations, where this function is called from.
gossip is only used here to get additional node attributed like dc and rack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants