Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes location and status get lost after full cluster restart with DOWN nodes #15787

Closed
bhalevy opened this issue Oct 22, 2023 · 19 comments
Closed
Assignees
Labels
Backport candidate Field-Tier1 Urgent issues as per FieldEngineering request
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Oct 22, 2023

If the cluster is fully restarted (say after power loss), and some nodes do not come up
(for example, after losing their ephemeral storage), we end up in a state when their endpoint_state,
as initialized by gossiper::add_saved_endpoint, contains only zeroed generation and version, and the endpoint HOST_ID.

In particular, the endpoint_state is missing:

  1. DC and RACK
  2. STATUS

Having no DC/RACK state can potentially confuse NetworkTopologyStrategy since those will not be populated in locator::topology.

Having no STATUS leads to the nodes being removed from gossip after the quarantine period.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 22, 2023

Adding DC/RACK from system.peers and initializing STATUS=shutdown empirically proved to be safe as this is the best a nodes known about the down nodes.
The endpoint state can then be updated by other nodes over gossip, if they have better knowledge of the endpoint state.

Note that in split-brain situations, other nodes may have stale/wrong information, but we don't have a good way to prevent this state to "infect" other nodes that restarted and initialized the endpoint state as described above.
In this case we need to shut the whole cluster down again and restart the nodes so all of them will initialize the endpoint_state of the DOWN nodes as described above and when those nodes either rejoin the cluster or are replaced, the endpoint_state will get updated respectively.

bhalevy added a commit to bhalevy/scylla that referenced this issue Oct 22, 2023
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack
info from storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available, and initialize the endpoint_state STATUS to `shutdown`
if it owns tokens, as this is the best we know about the endpoint at this stage.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@asias
Copy link
Contributor

asias commented Oct 23, 2023

It is not safe to inject STATUS shutdown for other nodes. Injecting it is worse than what it tires to solve. https://github.com/scylladb/scylladb/pull/15788/files/e9e787868e5460f42731228dd6daaeb61add6ace#r1368045304

What we can do is that prevent the replacing ops to proceed if a node with unknown status is seen, we do the same check for bootstrapping a node.

To recovery a cluster from a full cluster down + losing nodes, which should happen rarely, we can introduce an API to inject the status for the lost node explicitly.

This way, we can both recovery a messed up cluster and be safe at the same time.

For rack and dc info, we can load it during boot unconditionally and use it on the node that loads it. So network topology strategy would work. If each node loads it, it is not necessary to inject for other nodes.

@asias
Copy link
Contributor

asias commented Oct 23, 2023

I have another idea:

When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.

--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2

When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.

No extra step to run api to injecting dc/rack/status is needed.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 23, 2023

But we must have the complete dc/rack information for token mapping in the steady state as well with network-topology replication strategy, otherwise we'll access the wrong nodes, both for reads, and worse, for writes.

@asias
Copy link
Contributor

asias commented Oct 23, 2023

But we must have the complete dc/rack information for token mapping in the steady state as well with network-topology replication strategy, otherwise we'll access the wrong nodes, both for reads, and worse, for writes.

Current code already loads dc and rack info during boot up.

...
tmptr->update_topology(ep, get_dc_rack(ep), locator::node::state::normal);
...
co_await _gossiper.add_saved_endpoint(ep);
...

Each node using the loaded dc and rack does not mean each node has to propagate such information on behave of another load.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 23, 2023

I have another idea:

When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.

--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2

When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.

No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?
Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

@asias
Copy link
Contributor

asias commented Oct 23, 2023

I have another idea:
When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.
--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2
When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.
No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?

Why? It only extends the existing parameter.

Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

What do you mean exactly? We need the first node to be able to boot after a full cluster down.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

Note, only the nodes that is running replace ops is having the option --ignore-dead-nodes-for-replace. It can not load the rack and dc from the sytem table because it is an empty node when it runs replace. However, the replace node can choose to use the rack and dc info from the cluster (that is the common case, the cluster is still have the full info for the dead node).

So, we can go one step further.

One can still use normally:

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore

In case there was a full cluster down, we can not get the DC / Rack / STATUS. We reject the replace ops. Asking user to provide

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore:rack:dc

so replace can continue.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 23, 2023

I have another idea:
When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.
--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2
When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.
No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?

Why? It only extends the existing parameter.

I mean "too late", sorry.
In the sense that the problem is with initializing the endpoint state of the other nodes in the cluster, that happens before replace -hence it is too late to be provided at replace time.

Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

What do you mean exactly? We need the first node to be able to boot after a full cluster down.

Hmm, what I have in mind is all of them starting concurrently, but you're right that we need to support restarting them one at a time.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

Note, only the nodes that is running replace ops is having the option --ignore-dead-nodes-for-replace.

That true today, but we can change that.
(We even have a FIXME for it:

// TODO: specify ignore_nodes for bootstrap
)

It can not load the rack and dc from the sytem table because it is an empty node when it runs replace. However, the replace node can choose to use the rack and dc info from the cluster (that is the common case, the cluster is still have the full info for the dead node).

But after full restart we don't load DC/RACK endpoint state without this patch, so other nodes don't propagate it to the replacing node.

So, we can go one step further.

One can still use normally:

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore

In case there was a full cluster down, we can not get the DC / Rack / STATUS. We reject the replace ops. Asking user to provide

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore:rack:dc

so replace can continue.

If those nodes are dead, all other nodes need to reflect that in an official way and they all need to agree about their state.
If we manually tell the other nodes that we restart after full cluster shutdown that some nodes are dead, with or without specifying their dc/rack location (that can be retrieved from system.peers, no need to extend the flag for that IMO), they can initialize the endpoint_state similar to what is proposed in the patch, and then the replacing node can be passed only --ignore-dead-nodes-for-replace as we do today and it can get their endoint_state exactly like it does today.

@asias
Copy link
Contributor

asias commented Oct 24, 2023

I have another idea:
When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.
--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2
When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.
No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?

Why? It only extends the existing parameter.

I mean "too late", sorry. In the sense that the problem is with initializing the endpoint state of the other nodes in the cluster, that happens before replace -hence it is too late to be provided at replace time.

It is not too late because existing nodes load both tokens and rack/dc info when it restarts. The replacing node is the only one that does not know the rack and dc info

Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

What do you mean exactly? We need the first node to be able to boot after a full cluster down.

Hmm, what I have in mind is all of them starting concurrently, but you're right that we need to support restarting them one at a time.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

Note, only the nodes that is running replace ops is having the option --ignore-dead-nodes-for-replace.

That true today, but we can change that. (We even have a FIXME for it:

// TODO: specify ignore_nodes for bootstrap

The FIXME is about adding support --ignore-dead-nodes-for-bootstrap. We can do similar things for bootstrap node too.

By "only the nodes", I meant we do not specify the options to the existing nodes in the cluster. We only add this option to the replacing node. If we add --ignore-dead-nodes-for-bootstrap, we add the option to bootstrap node too. But in both cases, they are new node.

It can not load the rack and dc from the sytem table because it is an empty node when it runs replace. However, the replace node can choose to use the rack and dc info from the cluster (that is the common case, the cluster is still have the full info for the dead node).

But after full restart we don't load DC/RACK endpoint state without this patch, so other nodes don't propagate it to the replacing node.

Yes. That's why I proposed two solutions above

  1. use an explicit rest api to inject the dc/rack status info for the nodes that are down

  2. specify the dc and rack info using --ignore-dead-nodes-for-bootstrap node_host_id:node_rack:node_dc

It is best to avoid each node to inject the status for other nodes on restart. It is not the true of source.

So, we can go one step further.
One can still use normally:
--ignore-dead-nodes-for-replace host_id_of_node_to_ignore
In case there was a full cluster down, we can not get the DC / Rack / STATUS. We reject the replace ops. Asking user to provide
--ignore-dead-nodes-for-replace host_id_of_node_to_ignore:rack:dc
so replace can continue.

If those nodes are dead, all other nodes need to reflect that in an official way and they all need to agree about their state. If we manually tell the other nodes that we restart after full cluster shutdown that some nodes are dead, with or without specifying their dc/rack location (that can be retrieved from system.peers, no need to extend the flag for that IMO), they can initialize the endpoint_state similar to what is proposed in the patch, and then the replacing node can be passed only --ignore-dead-nodes-for-replace as we do today and it can get their endoint_state exactly like it does today.

The above two solutions could be the "official way" to make the cluster agree about the down nodes states.

Let's agree it is better to have a admin interference when such rare situation happens (multiple nodes down + full cluster down). Actually, the admin must interfere because the admin needs to run the replace procedure anyway. We can piggyback "tell the down nodes info" with the replace ops in one step with using --ignore-dead-nodes-for-bootstrap node_host_id:node_rack:node_dc.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 24, 2023

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others.
In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.

This method can be applied also in the common case of a single node going down in normal operation mode.
remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

@asias
Copy link
Contributor

asias commented Oct 24, 2023

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others. In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

Currently, if there are nodes down and there was no topology change. Each node will use correct replication map no matter what (e.g., with/without restart / full cluster down)

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.

This method can be applied also in the common case of a single node going down in normal operation mode. remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

There is no such state called DEAD currently. A node can be UP or DOWN (by gossip). A node can be in the ring or removed from the ring. I think what you actually meant is to isolate the node, so that the node could not join back to the cluster.

The node should not come back only after the new replace has started. It can come back with no problem if there is no replace ops. This makes it a perfect fit for the replacing node to trigger the isolation to isolate the node not an extra API call to do so because admin can forget to do so very easily.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 24, 2023

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others. In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

Currently, if there are nodes down and there was no topology change. Each node will use correct replication map no matter what (e.g., with/without restart / full cluster down)

But the endpoint_state in gossip still seems broken.

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.
This method can be applied also in the common case of a single node going down in normal operation mode. remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

There is no such state called DEAD currently. A node can be UP or DOWN (by gossip). A node can be in the ring or removed from the ring. I think what you actually meant is to isolate the node, so that the node could not join back to the cluster.

yes (in the past I proposed to add a system.quarantine table IIRC. That should be taken care of with raft-based consistent topology changes AFAIK, but I'm looking for a change that could be backported to older versions)

The node should not come back only after the new replace has started. It can come back with no problem if there is no replace ops. This makes it a perfect fit for the replacing node to trigger the isolation to isolate the node not an extra API call to do so because admin can forget to do so very easily.

I agree it should be done automatically by replace (and remove node).

So what you're suggesting is to add dc and rack to --ignore-dead-nodes-for-replace to populate the local endpoint_state on the replacing node as well as to build a correct replication map (including token_metadata and locator::topology)
instead of relying on gossip to get those application states?

Would we be able to map the host_id in the option to an ip address without the endpoint_state in gossip?
I guess we'll need to add the ip address too if the replacing node knows nothing about the dead nodes as the replacing node has no persistent local state and other nodes may have no sharable endpoint_state of the dead nodes after full-cluster restart.

@asias
Copy link
Contributor

asias commented Oct 24, 2023

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others. In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

Currently, if there are nodes down and there was no topology change. Each node will use correct replication map no matter what (e.g., with/without restart / full cluster down)

But the endpoint_state in gossip still seems broken.

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.
This method can be applied also in the common case of a single node going down in normal operation mode. remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

There is no such state called DEAD currently. A node can be UP or DOWN (by gossip). A node can be in the ring or removed from the ring. I think what you actually meant is to isolate the node, so that the node could not join back to the cluster.

yes (in the past I proposed to add a system.quarantine table IIRC. That should be taken care of with raft-based consistent topology changes AFAIK, but I'm looking for a change that could be backported to older versions)

The node should not come back only after the new replace has started. It can come back with no problem if there is no replace ops. This makes it a perfect fit for the replacing node to trigger the isolation to isolate the node not an extra API call to do so because admin can forget to do so very easily.

I agree it should be done automatically by replace (and remove node).

So what you're suggesting is to add dc and rack to --ignore-dead-nodes-for-replace to populate the local endpoint_state on the replacing node as well as to build a correct replication map (including token_metadata and locator::topology) instead of relying on gossip to get those application states?

Yes. Do not replying on gossip from the peer nodes to inject the rack and dc info. But the replacing node can inject the rack/dc/status=normal into gossip, so we can reuse the existing code to set up the correct token_metadata and locator::topology for the dead nodes.

There is one thing I do not like in this solution is that admin can make mistakes when providing the rack and dc. I guess the chance is low though.

Would we be able to map the host_id in the option to an ip address without the endpoint_state in gossip? I guess we'll need to add the ip address too if the replacing node knows nothing about the dead nodes as the replacing node has no persistent local state and other nodes may have no sharable endpoint_state of the dead nodes after full-cluster restart.

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

@bhalevy
Copy link
Member Author

bhalevy commented Oct 24, 2023

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

@asias
Copy link
Contributor

asias commented Oct 24, 2023

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

The endpoint_state will be removed only if the node is a gossip only member which means it is not part of the ring. Since we inject the status=normal on the replacing node for dead nodes, they will be part of the ring and will not be evicted.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

We could do that. But this requires the other nodes to inject the dc and rack though. I wish we could avoid injecting as less states as possible. The host_id is injected so that removenode could work after full cluster down. Note, we only need to provide the rack/dc after full cluster down as I mentioned here #15787 (comment).

@bhalevy
Copy link
Member Author

bhalevy commented Oct 24, 2023

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

The endpoint_state will be removed only if the node is a gossip only member which means it is not part of the ring. Since we inject the status=normal on the replacing node for dead nodes, they will be part of the ring and will not be evicted.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

We could do that. But this requires the other nodes to inject the dc and rack though. I wish we could avoid injecting as less states as possible. The host_id is injected so that removenode could work after full cluster down. Note, we only need to provide the rack/dc after full cluster down as I mentioned here #15787 (comment).

But if those dead nodes are present in system.peers on the living nodes including HOST_ID, DC, RACK, and TOKENS, why can't we rely on that? We're not injecting those states out of thin air.

@asias
Copy link
Contributor

asias commented Oct 24, 2023

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

The endpoint_state will be removed only if the node is a gossip only member which means it is not part of the ring. Since we inject the status=normal on the replacing node for dead nodes, they will be part of the ring and will not be evicted.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

We could do that. But this requires the other nodes to inject the dc and rack though. I wish we could avoid injecting as less states as possible. The host_id is injected so that removenode could work after full cluster down. Note, we only need to provide the rack/dc after full cluster down as I mentioned here #15787 (comment).

But if those dead nodes are present in system.peers on the living nodes including HOST_ID, DC, RACK, and TOKENS, why can't we rely on that? We're not injecting those states out of thin air.

We already inject host_id and tokens. We are safe to inject dc and rack. I do not have strong opinion against injecting dc and rack. I thought about it as a way to validate dc and rack that are provided by the admin. The gain is limited.

So, anyway, in summary:

I think we can inject dc and rack during boot up. It is safe to do so. We let the replacing node to inject status=normal for the dead nodes only in case we do not have the status for them after full cluster down. We should do the injection very carefully.

@mykaul mykaul added the Field-Tier1 Urgent issues as per FieldEngineering request label Nov 5, 2023
bhalevy added a commit to bhalevy/scylla that referenced this issue Nov 6, 2023
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and on that path, also populate locator::topology and token_metadata
accordingly with the down nodes' tokens.

However, a bootstrapping node, has no persistent knowledge of the down nodes,
and it learns about their existance from gossip.
But, since the down nodes have unknown status, they never go
through `handle_state_normal`.

This change updates the down nodes state in topology
and token_metadata as if they were loaded from system tables
in `join_cluster`.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Dec 6, 2023
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and on that path, also populate locator::topology and token_metadata
accordingly with the down nodes' tokens.

However, a bootstrapping node, has no persistent knowledge of the down nodes,
and it learns about their existance from gossip.
But, since the down nodes have unknown status, they never go
through `handle_state_normal`.

This change updates the down nodes state in topology
and token_metadata as if they were loaded from system tables
in `join_cluster`.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Dec 11, 2023
Load saved endpoints with an intermediate `LOADED` status
that is not `UNKNOWN` but not `NORMAL` yet either.

It will be used in a latter patch to populate
bootstrapping nodes with the state of DOWN nodes
as the bootstrapping nodes have no system.peers
table locally and they have no other way of knowing
about DOWN nodes after a full cluster restart.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Dec 11, 2023
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack
info from storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Dec 11, 2023
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and on that path, also populate locator::topology and token_metadata
accordingly with the down nodes' tokens.

However, a bootstrapping node has no persistent knowledge of the down nodes,
and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal`.

This change updates the down nodes state in topology
and token_metadata as if they were loaded from system tables
in `join_cluster`.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Dec 11, 2023
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack
info from storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Dec 11, 2023
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@tgrabiec
Copy link
Contributor

This looks like another incarnation of #6088.

Why not restore STATUS, so that ring is consistent for all the ops? I don't like fixing up the ring from replace, it works-around bug in state propagation, piling up even more complexity.

What about removenode, doesn't it also need accurate ring info?

bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 5, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 5, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 18, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 18, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Feb 26, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 18, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 18, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 29, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
bhalevy added a commit to bhalevy/scylla that referenced this issue Mar 29, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` and therefore they are not accounted
as normal token owners.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables
in `prepare_replacement_info` based on the endpoint_state retrieved
in the shadow round initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@mykaul mykaul added this to the 6.1 milestone Apr 3, 2024
bhalevy added a commit to bhalevy/scylla that referenced this issue Apr 14, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
kbr-scylla added a commit that referenced this issue Apr 16, 2024
…load the ignored nodes state for replace' from Benny Halevy

The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.

When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.

Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.

Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.

Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.

The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.

But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.

The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`.  If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).

Fixes #15787

Closes #15788

* github.com:scylladb/scylladb:
  storage_service: join_token_ring: load ignored nodes state if replacing
  storage_service: replacement_info: return ignore_nodes state
  locator: host_id_or_endpoint: keep value as variant
  gms: endpoint_state: add getters for host_id, dc_rack, and tokens
  storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
  gossiper: add_saved_endpoint: set dc and rack
  gossiper: add_saved_endpoint: fixup indentation
  gossiper: add_saved_endpoint: make host_id mandatory
  gossiper: add load_endpoint_state
  gossiper: start_gossiping: log local state
mergify bot pushed a commit that referenced this issue Apr 16, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` (in gossiper mode) and therefore
they are not accounted as normal token owners.
This is handled by `topology_state_load`, but not with
gossip-based node operations.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables,
after calling `prepare_replacement_info` when raft topology changes are
disabled, based on the endpoint_state retrieved in the shadow round
initiated in prepare_replacement_info.

Fixes #15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 655d624)

# Conflicts:
#	service/storage_service.cc
mergify bot pushed a commit that referenced this issue Apr 16, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` (in gossiper mode) and therefore
they are not accounted as normal token owners.
This is handled by `topology_state_load`, but not with
gossip-based node operations.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables,
after calling `prepare_replacement_info` when raft topology changes are
disabled, based on the endpoint_state retrieved in the shadow round
initiated in prepare_replacement_info.

Fixes #15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 655d624)
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following scylladb#16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` (in gossiper mode) and therefore
they are not accounted as normal token owners.
This is handled by `topology_state_load`, but not with
gossip-based node operations.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables,
after calling `prepare_replacement_info` when raft topology changes are
disabled, based on the endpoint_state retrieved in the shadow round
initiated in prepare_replacement_info.

Fixes scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport candidate Field-Tier1 Urgent issues as per FieldEngineering request
Projects
None yet
5 participants