-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use topology nodes for repair and node operations #12506
base: master
Are you sure you want to change the base?
Conversation
converted to draft until #11987 is merged |
CI state |
75493cf
to
de42cc5
Compare
Depends also on #12316 |
bf78dc7
to
2feb46c
Compare
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and get rid of the ad-hoc implementation in network_topology_strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the thread_local default_location out of topology::get_location so it can be used elsewhere. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
storage_service::replicate_to_all_cores has a sophisticated way to mutate the token_metadata and effective_replication_map on shard 0 and cloning those to all other shards, applying the changes only mutate and clone succeeded on all shards so we don't end up with only some of the shards with the mutated copy if an error happend mid-way (and then we would need to roll-back the change for exception safety). shared_token_metadata::mutate_token_metadata is currently only called from a unit test that needs to mutate the token metadata only on shard 0, but a following patch will require doing that on all shards. This change adds this capbility by enforcing the call to be on shard 0m mutating the token_metdata into a temporary pending copy and cloning it on all other shards. Only then, when all shard succeeded, set the modified token_metadata on all shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep per node information (host_id, endpoint, dc_rack, is_pending) in node objects, indexed by topology on several indices like: host_id, endpoint, current/pending, per dc, per dc/rack. Refs scylladb#6403 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The node index is a shothand identifier for the node. To be used for mapping nodes in token metadata and node operations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a simple node state model with: joining, normal, leaving, left states to help managing nodes during replace with the the same ip address. Later on, this could also help prevent nodes that were decommissioned, removed, or replaced from rejoining the cluster. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use netw::msg_addr directly instead, so fmt::formatter would know to use std::operator<<(..., const netw::msg_addr&), as it gets confused by type aliasing. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can associate the inet_address with a unique host_id. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…quest To be used by storage_service to make node_ops commands based on node_ptr:s rather than endpoints. Prepare for v2, where endpoints will encode the node index as a raw ip address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to the local node and track other nodes to sync with all the way to node_ops_cmd_heartbeat_updater and send_node_ops_cmd. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to the local node and track other nodes to sync with all the way to node_ops_cmd_heartbeat_updater and send_node_ops_cmd. Also, base the list of nodes to sync with on the token_metadata get_endpoint_to_host_id_map_for_reading thta lists all normal token owners, rather than on the nodes in gossip, as the token_metadata is the source of truth for node operation as in all other ops. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use std::variant<host_id, inet_address> to keep either of them and resolve using topology to a node_ptr. Reflect that also from token_metadata::parse_host_id_and_endpoint that gets a string describing either a host_id or inet_address and returns a node_ptr if found. Exceptions are thrown if the string is invalid or if the node is not found in topology. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to track nodes to sync with all the way to node_ops_cmd_heartbeat_updater and send_node_ops_cmd. Also, base the list of nodes to sync with on the token_metadata get_endpoint_to_host_id_map_for_reading thta lists all normal token owners, rather than on the nodes in gossip, as the token_metadata is the source of truth for node operation as in all other ops. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use topology to get to the leaving node and track other nodes to sync with all the way to node_ops_cmd_heartbeat_updater and send_node_ops_cmd. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With node_ops_cmd_v2, use the node index as the inet_address in the existing node_ops_cmd_request fields and send an additional dictionary mapping each node index to a pair of <host_id, inet_address>. Provide a set of getters that extract either the legacy field from the request if the dictionary isn't provided, or use the dictionary to translate the raw addresses, that contain node indexes into the real node address. A later patch, will use the host_id in the dictionary to look up nodes in storage_service::node_ops_cmd_handler and both the host_id and inet_address to populate new nodes in topology for the replace and bootstrap case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add helpers that print a warning or error to the log and then throw an exception with that message, as this is a common pattern we use. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As a first step, find each node to ignore given its endpoint or host_id, if available. At this stage, translate those back into a list of endpoint when passing to node_ops_meta_data. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…s_info Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to pass the tokens by value, since add_bootstrap_token(s) and remove_bootstrap_tokens are synchronous. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2feb46c
to
3de0b79
Compare
@@ -34,6 +34,7 @@ future<> one_test(const std::string& property_fname, bool exp_result) { | |||
|
|||
utils::fb_utilities::set_broadcast_address(gms::inet_address("localhost")); | |||
utils::fb_utilities::set_broadcast_rpc_address(gms::inet_address("localhost")); | |||
utils::fb_utilities::set_host_id(locator::host_id::create_random_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really find a better place for the local address, than a function-local global. For one, it is really hard to obtain it in gdb.
Maybe not in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. That could be done as a follow up to #11987
Maybe topology itself would actually be the best place to keep the local address and host_id, on the local node.
@@ -19,7 +19,6 @@ using namespace netw; | |||
namespace api { | |||
|
|||
using shard_info = messaging_service::shard_info; | |||
using msg_addr = messaging_service::msg_addr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the changes to accommodate fir it) looks like it should be part of the previous commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, will look into it for next version
|
||
namespace netw { | ||
|
||
struct msg_addr { | ||
gms::inet_address addr; | ||
uint32_t cpu_id; | ||
locator::host_id host_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this.
Is this a temporary measure to bridge us to the time when class node
is used everywhere?
This effectively limits msg_addr
to inter-node RPC usage. Maybe that is fine, it is in the messaging/
"module" after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I like how natural this makes it to pass nodes to messaging layer.
auto req = node_ops_cmd_request(cmd, uuid, std::move(tables)); | ||
|
||
auto get_endpoint = [] (const locator::node_ptr& node) { | ||
// Prep for v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v2 of what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now know that it refers to v2 of the cmd request verb structure, but at this point in the PR it was not at all obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, will clarify
|
||
auto start_time = std::chrono::steady_clock::now(); | ||
for (;;) { | ||
sync_nodes.clear(); | ||
// Step 1: Decide who needs to sync data for bootstrap operation | ||
for (const auto& x :_gossiper.get_endpoint_states()) { | ||
for (const auto& [ep, host_id] : tmptr->get_endpoint_to_host_id_map_for_reading()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this change is a correctness issue, it might worth extracting it into a separate commit, in case we can trace some bug to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're absolutely correct.
It's #13066
sync_nodes.push_back(node); | ||
auto node = topo.find_node(host_id, locator::topology::must_exist::yes); | ||
slogger.info("bootstrap[{}]: Check node={}, status={}", uuid, node, _gossiper.get_gossip_status(node->endpoint())); | ||
if (node != local_node && !ignore_nodes.contains(node)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node != local_node
could be just removed here, and then we'd not need to explicitly add local node below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, it should be done in a separate change though, as this is just translation of the current logic to use node ptrs.
and #12799 and this series turn sync_nodes into an unordered_set so the insertion order doesn't matter.
Sending is done in parallel anyhow so I see no point special casing the local node anyhow.
} | ||
|
||
gms::inet_address endpoint() const { | ||
return std::get<gms::inet_address>(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the host_id/endpoint exclusive also before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it didn't have to be, but in case both were set, we should have preferred the host_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now, for having both, we have host_id_and_endpoint
defined below which is a simple struct containing both members.
// v2: | ||
// Optional field, maps node index to (host_id, inet_address) | ||
// gms:inet_address values above will contain the node index as a raw ipv4 address | ||
std::unordered_map<locator::node::idx_type, locator::host_id_and_endpoint> nodes_dict; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we send node indexes across nodes? The remove might have a completely different node index <-> node mapping.
This seems quite hacky to me. I think we should just use a completely new verb and structure in v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later I understood that the node indexes are into this map here that is also transferred on the wire. So that is fine. Still think this is quite hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's a way to extend the rpc in a backward compatible way without requiring a new set of verbs and handlers.
.rack = (dc && rack) ? rack->value : locator::endpoint_dc_rack::default_location.rack, | ||
}; | ||
return topo.add_node(haep.host_id, haep.endpoint, dc_rack, state); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused: isn't topology the definitive source of node information? Why are we getting node data from gossip and adding it to topology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We learn about new nodes from node operations, where this function is called from.
gossip is only used here to get additional node attributed like dc and rack
This series depends on #11987 for introducing class node to topology.
It then uses shared node pointers to refer to nodes in node operations instead of using their endpoints,
in particular, to depend on the always-unique node host_id rather than on the endpoint that may change and may not be unique during replace node.