Nodes location and status get lost after full cluster restart with DOWN nodes #15787

bhalevy · 2023-10-22T09:53:52Z

If the cluster is fully restarted (say after power loss), and some nodes do not come up
(for example, after losing their ephemeral storage), we end up in a state when their endpoint_state,
as initialized by gossiper::add_saved_endpoint, contains only zeroed generation and version, and the endpoint HOST_ID.

In particular, the endpoint_state is missing:

DC and RACK
STATUS

Having no DC/RACK state can potentially confuse NetworkTopologyStrategy since those will not be populated in locator::topology.

Having no STATUS leads to the nodes being removed from gossip after the quarantine period.

The text was updated successfully, but these errors were encountered:

bhalevy · 2023-10-22T09:57:04Z

Adding DC/RACK from system.peers and initializing STATUS=shutdown empirically proved to be safe as this is the best a nodes known about the down nodes.
The endpoint state can then be updated by other nodes over gossip, if they have better knowledge of the endpoint state.

Note that in split-brain situations, other nodes may have stale/wrong information, but we don't have a good way to prevent this state to "infect" other nodes that restarted and initialized the endpoint state as described above.
In this case we need to shut the whole cluster down again and restart the nodes so all of them will initialize the endpoint_state of the DOWN nodes as described above and when those nodes either rejoin the cluster or are replaced, the endpoint_state will get updated respectively.

When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available, and initialize the endpoint_state STATUS to `shutdown` if it owns tokens, as this is the best we know about the endpoint at this stage. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

asias · 2023-10-23T02:07:12Z

It is not safe to inject STATUS shutdown for other nodes. Injecting it is worse than what it tires to solve. https://github.com/scylladb/scylladb/pull/15788/files/e9e787868e5460f42731228dd6daaeb61add6ace#r1368045304

What we can do is that prevent the replacing ops to proceed if a node with unknown status is seen, we do the same check for bootstrapping a node.

To recovery a cluster from a full cluster down + losing nodes, which should happen rarely, we can introduce an API to inject the status for the lost node explicitly.

This way, we can both recovery a messed up cluster and be safe at the same time.

For rack and dc info, we can load it during boot unconditionally and use it on the node that loads it. So network topology strategy would work. If each node loads it, it is not necessary to inject for other nodes.

asias · 2023-10-23T02:35:05Z

I have another idea:

When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.

--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2

When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.

No extra step to run api to injecting dc/rack/status is needed.

bhalevy · 2023-10-23T04:51:49Z

But we must have the complete dc/rack information for token mapping in the steady state as well with network-topology replication strategy, otherwise we'll access the wrong nodes, both for reads, and worse, for writes.

asias · 2023-10-23T06:26:11Z

But we must have the complete dc/rack information for token mapping in the steady state as well with network-topology replication strategy, otherwise we'll access the wrong nodes, both for reads, and worse, for writes.

Current code already loads dc and rack info during boot up.

...
tmptr->update_topology(ep, get_dc_rack(ep), locator::node::state::normal);
...
co_await _gossiper.add_saved_endpoint(ep);
...

Each node using the loaded dc and rack does not mean each node has to propagate such information on behave of another load.

bhalevy · 2023-10-23T08:29:52Z

I have another idea:

When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.

--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2

When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.

No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?
Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

asias · 2023-10-23T08:50:36Z

I have another idea:
When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.
--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2
When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.
No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?

Why? It only extends the existing parameter.

Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

What do you mean exactly? We need the first node to be able to boot after a full cluster down.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

Note, only the nodes that is running replace ops is having the option --ignore-dead-nodes-for-replace. It can not load the rack and dc from the sytem table because it is an empty node when it runs replace. However, the replace node can choose to use the rack and dc info from the cluster (that is the common case, the cluster is still have the full info for the dead node).

So, we can go one step further.

One can still use normally:

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore

In case there was a full cluster down, we can not get the DC / Rack / STATUS. We reject the replace ops. Asking user to provide

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore:rack:dc

so replace can continue.

bhalevy · 2023-10-23T09:23:08Z

I have another idea:
When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.
--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2
When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.
No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?

Why? It only extends the existing parameter.

I mean "too late", sorry.
In the sense that the problem is with initializing the endpoint state of the other nodes in the cluster, that happens before replace -hence it is too late to be provided at replace time.

Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

What do you mean exactly? We need the first node to be able to boot after a full cluster down.

Hmm, what I have in mind is all of them starting concurrently, but you're right that we need to support restarting them one at a time.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

Note, only the nodes that is running replace ops is having the option --ignore-dead-nodes-for-replace.

That true today, but we can change that.
(We even have a FIXME for it:

scylladb/service/storage_service.cc

Line 2933 in f181ac0

// TODO: specify ignore_nodes for bootstrap

)

It can not load the rack and dc from the sytem table because it is an empty node when it runs replace. However, the replace node can choose to use the rack and dc info from the cluster (that is the common case, the cluster is still have the full info for the dead node).

But after full restart we don't load DC/RACK endpoint state without this patch, so other nodes don't propagate it to the replacing node.

So, we can go one step further.

One can still use normally:

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore

In case there was a full cluster down, we can not get the DC / Rack / STATUS. We reject the replace ops. Asking user to provide

--ignore-dead-nodes-for-replace host_id_of_node_to_ignore:rack:dc

so replace can continue.

If those nodes are dead, all other nodes need to reflect that in an official way and they all need to agree about their state.
If we manually tell the other nodes that we restart after full cluster shutdown that some nodes are dead, with or without specifying their dc/rack location (that can be retrieved from system.peers, no need to extend the flag for that IMO), they can initialize the endpoint_state similar to what is proposed in the patch, and then the replacing node can be passed only --ignore-dead-nodes-for-replace as we do today and it can get their endoint_state exactly like it does today.

asias · 2023-10-24T02:12:38Z

I have another idea:
When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.
--ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c:dc1:rack1,125ed9f4-7777-1dbn-mac8-43fddce9123e:dc2:rack2
When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes.
No extra step to run api to injecting dc/rack/status is needed.

Isn't it tool late?

Why? It only extends the existing parameter.

I mean "too late", sorry. In the sense that the problem is with initializing the endpoint state of the other nodes in the cluster, that happens before replace -hence it is too late to be provided at replace time.

It is not too late because existing nodes load both tokens and rack/dc info when it restarts. The replacing node is the only one that does not know the rack and dc info

Maybe nodes should refuse to start if they observe nodes that are supposed to be normal token owners but no other node in the cluster was able to communicate with them.

What do you mean exactly? We need the first node to be able to boot after a full cluster down.

Hmm, what I have in mind is all of them starting concurrently, but you're right that we need to support restarting them one at a time.

In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less).

Note, only the nodes that is running replace ops is having the option --ignore-dead-nodes-for-replace.

That true today, but we can change that. (We even have a FIXME for it:

scylladb/service/storage_service.cc

Line 2933 in f181ac0

// TODO: specify ignore_nodes for bootstrap

The FIXME is about adding support --ignore-dead-nodes-for-bootstrap. We can do similar things for bootstrap node too.

By "only the nodes", I meant we do not specify the options to the existing nodes in the cluster. We only add this option to the replacing node. If we add --ignore-dead-nodes-for-bootstrap, we add the option to bootstrap node too. But in both cases, they are new node.

It can not load the rack and dc from the sytem table because it is an empty node when it runs replace. However, the replace node can choose to use the rack and dc info from the cluster (that is the common case, the cluster is still have the full info for the dead node).

But after full restart we don't load DC/RACK endpoint state without this patch, so other nodes don't propagate it to the replacing node.

Yes. That's why I proposed two solutions above

use an explicit rest api to inject the dc/rack status info for the nodes that are down
specify the dc and rack info using --ignore-dead-nodes-for-bootstrap node_host_id:node_rack:node_dc

It is best to avoid each node to inject the status for other nodes on restart. It is not the true of source.

So, we can go one step further.
One can still use normally:
--ignore-dead-nodes-for-replace host_id_of_node_to_ignore
In case there was a full cluster down, we can not get the DC / Rack / STATUS. We reject the replace ops. Asking user to provide
--ignore-dead-nodes-for-replace host_id_of_node_to_ignore:rack:dc
so replace can continue.

If those nodes are dead, all other nodes need to reflect that in an official way and they all need to agree about their state. If we manually tell the other nodes that we restart after full cluster shutdown that some nodes are dead, with or without specifying their dc/rack location (that can be retrieved from system.peers, no need to extend the flag for that IMO), they can initialize the endpoint_state similar to what is proposed in the patch, and then the replacing node can be passed only --ignore-dead-nodes-for-replace as we do today and it can get their endoint_state exactly like it does today.

The above two solutions could be the "official way" to make the cluster agree about the down nodes states.

Let's agree it is better to have a admin interference when such rare situation happens (multiple nodes down + full cluster down). Actually, the admin must interfere because the admin needs to run the replace procedure anyway. We can piggyback "tell the down nodes info" with the replace ops in one step with using --ignore-dead-nodes-for-bootstrap node_host_id:node_rack:node_dc.

bhalevy · 2023-10-24T05:18:01Z

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others.
In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.

This method can be applied also in the common case of a single node going down in normal operation mode.
remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

asias · 2023-10-24T06:12:17Z

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others. In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

Currently, if there are nodes down and there was no topology change. Each node will use correct replication map no matter what (e.g., with/without restart / full cluster down)

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.

This method can be applied also in the common case of a single node going down in normal operation mode. remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

There is no such state called DEAD currently. A node can be UP or DOWN (by gossip). A node can be in the ring or removed from the ring. I think what you actually meant is to isolate the node, so that the node could not join back to the cluster.

The node should not come back only after the new replace has started. It can come back with no problem if there is no replace ops. This makes it a perfect fit for the replacing node to trigger the isolation to isolate the node not an extra API call to do so because admin can forget to do so very easily.

bhalevy · 2023-10-24T06:39:19Z

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others. In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

Currently, if there are nodes down and there was no topology change. Each node will use correct replication map no matter what (e.g., with/without restart / full cluster down)

But the endpoint_state in gossip still seems broken.

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.
This method can be applied also in the common case of a single node going down in normal operation mode. remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

There is no such state called DEAD currently. A node can be UP or DOWN (by gossip). A node can be in the ring or removed from the ring. I think what you actually meant is to isolate the node, so that the node could not join back to the cluster.

yes (in the past I proposed to add a system.quarantine table IIRC. That should be taken care of with raft-based consistent topology changes AFAIK, but I'm looking for a change that could be backported to older versions)

The node should not come back only after the new replace has started. It can come back with no problem if there is no replace ops. This makes it a perfect fit for the replacing node to trigger the isolation to isolate the node not an extra API call to do so because admin can forget to do so very easily.

I agree it should be done automatically by replace (and remove node).

So what you're suggesting is to add dc and rack to --ignore-dead-nodes-for-replace to populate the local endpoint_state on the replacing node as well as to build a correct replication map (including token_metadata and locator::topology)
instead of relying on gossip to get those application states?

Would we be able to map the host_id in the option to an ip address without the endpoint_state in gossip?
I guess we'll need to add the ip address too if the replacing node knows nothing about the dead nodes as the replacing node has no persistent local state and other nodes may have no sharable endpoint_state of the dead nodes after full-cluster restart.

asias · 2023-10-24T06:59:21Z

Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others. In this case we still need to make sure that we have the right protections to be able to serve CQL using the remaining nodes, using the correct replication map, and we're able to refuse unsafe node operations that don't have the ignore nodes option, and serve them correctly with the ignore nodes option.

Currently, if there are nodes down and there was no topology change. Each node will use correct replication map no matter what (e.g., with/without restart / full cluster down)

But the endpoint_state in gossip still seems broken.

To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation.

This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a ignore_nodes flag.
This method can be applied also in the common case of a single node going down in normal operation mode. remove node and replace node can use the above path to mark it as permanently DEAD on all living nodes even before they apply any change to the token metadata, if more nodes are lost during e.g. replace.

There is no such state called DEAD currently. A node can be UP or DOWN (by gossip). A node can be in the ring or removed from the ring. I think what you actually meant is to isolate the node, so that the node could not join back to the cluster.

yes (in the past I proposed to add a system.quarantine table IIRC. That should be taken care of with raft-based consistent topology changes AFAIK, but I'm looking for a change that could be backported to older versions)

The node should not come back only after the new replace has started. It can come back with no problem if there is no replace ops. This makes it a perfect fit for the replacing node to trigger the isolation to isolate the node not an extra API call to do so because admin can forget to do so very easily.

I agree it should be done automatically by replace (and remove node).

So what you're suggesting is to add dc and rack to --ignore-dead-nodes-for-replace to populate the local endpoint_state on the replacing node as well as to build a correct replication map (including token_metadata and locator::topology) instead of relying on gossip to get those application states?

Yes. Do not replying on gossip from the peer nodes to inject the rack and dc info. But the replacing node can inject the rack/dc/status=normal into gossip, so we can reuse the existing code to set up the correct token_metadata and locator::topology for the dead nodes.

There is one thing I do not like in this solution is that admin can make mistakes when providing the rack and dc. I guess the chance is low though.

Would we be able to map the host_id in the option to an ip address without the endpoint_state in gossip? I guess we'll need to add the ip address too if the replacing node knows nothing about the dead nodes as the replacing node has no persistent local state and other nodes may have no sharable endpoint_state of the dead nodes after full-cluster restart.

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

bhalevy · 2023-10-24T07:40:21Z

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

asias · 2023-10-24T08:19:33Z

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

The endpoint_state will be removed only if the node is a gossip only member which means it is not part of the ring. Since we inject the status=normal on the replacing node for dead nodes, they will be part of the ring and will not be evicted.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

We could do that. But this requires the other nodes to inject the dc and rack though. I wish we could avoid injecting as less states as possible. The host_id is injected so that removenode could work after full cluster down. Note, we only need to provide the rack/dc after full cluster down as I mentioned here #15787 (comment).

bhalevy · 2023-10-24T08:29:28Z

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

The endpoint_state will be removed only if the node is a gossip only member which means it is not part of the ring. Since we inject the status=normal on the replacing node for dead nodes, they will be part of the ring and will not be evicted.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

We could do that. But this requires the other nodes to inject the dc and rack though. I wish we could avoid injecting as less states as possible. The host_id is injected so that removenode could work after full cluster down. Note, we only need to provide the rack/dc after full cluster down as I mentioned here #15787 (comment).

But if those dead nodes are present in system.peers on the living nodes including HOST_ID, DC, RACK, and TOKENS, why can't we rely on that? We're not injecting those states out of thin air.

asias · 2023-10-24T08:50:10Z

With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip.

First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state.

The endpoint_state will be removed only if the node is a gossip only member which means it is not part of the ring. Since we inject the status=normal on the replacing node for dead nodes, they will be part of the ring and will not be evicted.

Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well?

We could do that. But this requires the other nodes to inject the dc and rack though. I wish we could avoid injecting as less states as possible. The host_id is injected so that removenode could work after full cluster down. Note, we only need to provide the rack/dc after full cluster down as I mentioned here #15787 (comment).

But if those dead nodes are present in system.peers on the living nodes including HOST_ID, DC, RACK, and TOKENS, why can't we rely on that? We're not injecting those states out of thin air.

We already inject host_id and tokens. We are safe to inject dc and rack. I do not have strong opinion against injecting dc and rack. I thought about it as a way to validate dc and rack that are provided by the admin. The gain is limited.

So, anyway, in summary:

I think we can inject dc and rack during boot up. It is safe to do so. We let the replacing node to inject status=normal for the dead nodes only in case we do not have the status for them after full cluster down. We should do the injection very carefully.

When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and on that path, also populate locator::topology and token_metadata accordingly with the down nodes' tokens. However, a bootstrapping node, has no persistent knowledge of the down nodes, and it learns about their existance from gossip. But, since the down nodes have unknown status, they never go through `handle_state_normal`. This change updates the down nodes state in topology and token_metadata as if they were loaded from system tables in `join_cluster`. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Load saved endpoints with an intermediate `LOADED` status that is not `UNKNOWN` but not `NORMAL` yet either. It will be used in a latter patch to populate bootstrapping nodes with the state of DOWN nodes as the bootstrapping nodes have no system.peers table locally and they have no other way of knowing about DOWN nodes after a full cluster restart. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and on that path, also populate locator::topology and token_metadata accordingly with the down nodes' tokens. However, a bootstrapping node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal`. This change updates the down nodes state in topology and token_metadata as if they were loaded from system tables in `join_cluster`. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

tgrabiec · 2023-12-15T21:57:56Z

This looks like another incarnation of #6088.

Why not restore STATUS, so that ring is consistent for all the ops? I don't like fixing up the ring from replace, it works-around bug in state propagation, piling up even more complexity.

What about removenode, doesn't it also need accurate ring info?

When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>