New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes location and status get lost after full cluster restart with DOWN nodes #15787
Comments
Adding DC/RACK from system.peers and initializing STATUS=shutdown empirically proved to be safe as this is the best a nodes known about the down nodes. Note that in split-brain situations, other nodes may have stale/wrong information, but we don't have a good way to prevent this state to "infect" other nodes that restarted and initialized the endpoint state as described above. |
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available, and initialize the endpoint_state STATUS to `shutdown` if it owns tokens, as this is the best we know about the endpoint at this stage. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is not safe to inject STATUS shutdown for other nodes. Injecting it is worse than what it tires to solve. https://github.com/scylladb/scylladb/pull/15788/files/e9e787868e5460f42731228dd6daaeb61add6ace#r1368045304 What we can do is that prevent the replacing ops to proceed if a node with unknown status is seen, we do the same check for bootstrapping a node. To recovery a cluster from a full cluster down + losing nodes, which should happen rarely, we can introduce an API to inject the status for the lost node explicitly. This way, we can both recovery a messed up cluster and be safe at the same time. For rack and dc info, we can load it during boot unconditionally and use it on the node that loads it. So network topology strategy would work. If each node loads it, it is not necessary to inject for other nodes. |
I have another idea: When one uses the ignore_dead_nodes_for_replace, we require the user to provide the dc and rack info in addition to the ip or host_id.
When it is provided, the replacing node injects the dc/rack/status=shutdown for those ignored nodes. No extra step to run api to injecting dc/rack/status is needed. |
But we must have the complete dc/rack information for token mapping in the steady state as well with network-topology replication strategy, otherwise we'll access the wrong nodes, both for reads, and worse, for writes. |
Current code already loads dc and rack info during boot up.
Each node using the loaded dc and rack does not mean each node has to propagate such information on behave of another load. |
Isn't it tool late? In this case --ignore-dead-nodes could be applied on restart and then the dc and rack can either be loaded from system peers or be given by the config option (I like that part less). |
Why? It only extends the existing parameter.
What do you mean exactly? We need the first node to be able to boot after a full cluster down.
Note, only the nodes that is running replace ops is having the option --ignore-dead-nodes-for-replace. It can not load the rack and dc from the sytem table because it is an empty node when it runs replace. However, the replace node can choose to use the rack and dc info from the cluster (that is the common case, the cluster is still have the full info for the dead node). So, we can go one step further. One can still use normally: --ignore-dead-nodes-for-replace host_id_of_node_to_ignore In case there was a full cluster down, we can not get the DC / Rack / STATUS. We reject the replace ops. Asking user to provide --ignore-dead-nodes-for-replace host_id_of_node_to_ignore:rack:dc so replace can continue. |
I mean "too late", sorry.
Hmm, what I have in mind is all of them starting concurrently, but you're right that we need to support restarting them one at a time.
That true today, but we can change that. scylladb/service/storage_service.cc Line 2933 in f181ac0
But after full restart we don't load DC/RACK endpoint state without this patch, so other nodes don't propagate it to the replacing node.
If those nodes are dead, all other nodes need to reflect that in an official way and they all need to agree about their state. |
It is not too late because existing nodes load both tokens and rack/dc info when it restarts. The replacing node is the only one that does not know the rack and dc info
The FIXME is about adding support --ignore-dead-nodes-for-bootstrap. We can do similar things for bootstrap node too. By "only the nodes", I meant we do not specify the options to the existing nodes in the cluster. We only add this option to the replacing node. If we add --ignore-dead-nodes-for-bootstrap, we add the option to bootstrap node too. But in both cases, they are new node.
Yes. That's why I proposed two solutions above
It is best to avoid each node to inject the status for other nodes on restart. It is not the true of source.
The above two solutions could be the "official way" to make the cluster agree about the down nodes states. Let's agree it is better to have a admin interference when such rare situation happens (multiple nodes down + full cluster down). Actually, the admin must interfere because the admin needs to run the replace procedure anyway. We can piggyback "tell the down nodes info" with the replace ops in one step with using --ignore-dead-nodes-for-bootstrap node_host_id:node_rack:node_dc. |
Even if we have such an interface we need to make sure that the cluster still gets into a stable state before the interface is applied, since the admin may notice that some nodes are down only after restarting all the others. To me, it seems much safer to add an api to officially declare nodes as DEAD and marking them as such persistently in the system tables, as well as representing them as such in gossip (and locator::topology), because they are not coming back and must not be allowed back into the cluster. They can only go away using remove or replace node operation. This way a replacing node can learn about the dead nodes from the rest of the cluster, as it learns about all live nodes, and all the nodes in the cluster can agree about it, without the need of a This method can be applied also in the common case of a single node going down in normal operation mode. |
Currently, if there are nodes down and there was no topology change. Each node will use correct replication map no matter what (e.g., with/without restart / full cluster down)
There is no such state called DEAD currently. A node can be UP or DOWN (by gossip). A node can be in the ring or removed from the ring. I think what you actually meant is to isolate the node, so that the node could not join back to the cluster. The node should not come back only after the new replace has started. It can come back with no problem if there is no replace ops. This makes it a perfect fit for the replacing node to trigger the isolation to isolate the node not an extra API call to do so because admin can forget to do so very easily. |
But the endpoint_state in gossip still seems broken.
yes (in the past I proposed to add a system.quarantine table IIRC. That should be taken care of with raft-based consistent topology changes AFAIK, but I'm looking for a change that could be backported to older versions)
I agree it should be done automatically by replace (and remove node). So what you're suggesting is to add dc and rack to Would we be able to map the host_id in the option to an ip address without the endpoint_state in gossip? |
Yes. Do not replying on gossip from the peer nodes to inject the rack and dc info. But the replacing node can inject the rack/dc/status=normal into gossip, so we can reuse the existing code to set up the correct token_metadata and locator::topology for the dead nodes. There is one thing I do not like in this solution is that admin can make mistakes when providing the rack and dc. I guess the chance is low though.
With current code, the replacing node will know the host_id and ip of the dead nodes from gossip. So we can learn the ip address from gossip. |
First, we saw that the endpoint_state is evicted on the other nodes after some time, presumingly since it has no STATUS state. Second, if we ensure that the other nodes can provide the endpoint's HOST_ID, why not just get its DC and RACK from gossip as well? |
The endpoint_state will be removed only if the node is a gossip only member which means it is not part of the ring. Since we inject the status=normal on the replacing node for dead nodes, they will be part of the ring and will not be evicted.
We could do that. But this requires the other nodes to inject the dc and rack though. I wish we could avoid injecting as less states as possible. The host_id is injected so that removenode could work after full cluster down. Note, we only need to provide the rack/dc after full cluster down as I mentioned here #15787 (comment). |
But if those dead nodes are present in system.peers on the living nodes including HOST_ID, DC, RACK, and TOKENS, why can't we rely on that? We're not injecting those states out of thin air. |
We already inject host_id and tokens. We are safe to inject dc and rack. I do not have strong opinion against injecting dc and rack. I thought about it as a way to validate dc and rack that are provided by the admin. The gain is limited. So, anyway, in summary: I think we can inject dc and rack during boot up. It is safe to do so. We let the replacing node to inject status=normal for the dead nodes only in case we do not have the status for them after full cluster down. We should do the injection very carefully. |
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and on that path, also populate locator::topology and token_metadata accordingly with the down nodes' tokens. However, a bootstrapping node, has no persistent knowledge of the down nodes, and it learns about their existance from gossip. But, since the down nodes have unknown status, they never go through `handle_state_normal`. This change updates the down nodes state in topology and token_metadata as if they were loaded from system tables in `join_cluster`. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and on that path, also populate locator::topology and token_metadata accordingly with the down nodes' tokens. However, a bootstrapping node, has no persistent knowledge of the down nodes, and it learns about their existance from gossip. But, since the down nodes have unknown status, they never go through `handle_state_normal`. This change updates the down nodes state in topology and token_metadata as if they were loaded from system tables in `join_cluster`. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Load saved endpoints with an intermediate `LOADED` status that is not `UNKNOWN` but not `NORMAL` yet either. It will be used in a latter patch to populate bootstrapping nodes with the state of DOWN nodes as the bootstrapping nodes have no system.peers table locally and they have no other way of knowing about DOWN nodes after a full cluster restart. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and on that path, also populate locator::topology and token_metadata accordingly with the down nodes' tokens. However, a bootstrapping node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal`. This change updates the down nodes state in topology and token_metadata as if they were loaded from system tables in `join_cluster`. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This looks like another incarnation of #6088. Why not restore STATUS, so that ring is consistent for all the ops? I don't like fixing up the ring from replace, it works-around bug in state propagation, piling up even more complexity. What about removenode, doesn't it also need accurate ring info? |
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` and therefore they are not accounted as normal token owners. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables in `prepare_replacement_info` based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…load the ignored nodes state for replace' from Benny Halevy The problem this series solves is correctly ignoring DOWN nodes state when replacing a node. When a node is replaced and there are other nodes that are down, the replacing node is told to ignore those DOWN nodes using the `ignore_dead_nodes_for_replace` option. Since the replacing node is bootstrapping it starts with an empty system.peers table so it has no notion about any node state and it learns about all other nodes via gossip shadow round done in `storage_service::prepare_replacement_info`. Normally, since the DOWN nodes to ignore already joined the ring, the remaining node will have their endpoint state already in gossip, but if the whole cluster was restarted while those DOWN nodes did not start, the remaining nodes will only have a partial endpoint state from them, which is loaded from system.peers. Currently, the partial endpoint state contains only `HOST_ID` and `TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`. The first part of this series loads also `DC` and `RACK` from system.peers to make them available to the replacing node as they are crucial for building a correct replication map with network topology replication strategy. But still, without a `STATUS` those nodes are not considered as normal token owners yet, and they do not go through handle_state_normal which adds them to the topology and token_metadata. The second part of this series uses the endpoint state retrieved in the gossip shadow round to explicitly add the ignored nodes' state to topology (including dc and rack) and token_metadata (tokens) in `prepare_replacement_info`. If there are more DOWN nodes that are not explicitly ignored replace will fail (as it should). Fixes #15787 Closes #15788 * github.com:scylladb/scylladb: storage_service: join_token_ring: load ignored nodes state if replacing storage_service: replacement_info: return ignore_nodes state locator: host_id_or_endpoint: keep value as variant gms: endpoint_state: add getters for host_id, dc_rack, and tokens storage_service: topology_state_load: set local STATUS state using add_saved_endpoint gossiper: add_saved_endpoint: set dc and rack gossiper: add_saved_endpoint: fixup indentation gossiper: add_saved_endpoint: make host_id mandatory gossiper: add load_endpoint_state gossiper: start_gossiping: log local state
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` (in gossiper mode) and therefore they are not accounted as normal token owners. This is handled by `topology_state_load`, but not with gossip-based node operations. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables, after calling `prepare_replacement_info` when raft topology changes are disabled, based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes #15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit 655d624) # Conflicts: # service/storage_service.cc
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` (in gossiper mode) and therefore they are not accounted as normal token owners. This is handled by `topology_state_load`, but not with gossip-based node operations. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables, after calling `prepare_replacement_info` when raft topology changes are disabled, based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes #15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit 655d624)
When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node bootstraps or replaces a node after full cluster shutdown and restart, some nodes may be down. Existing nodes in the cluster load the down nodes TOKENS (and recently, in this series, also DC and RACK) from system.peers and then populate locator::topology and token_metadata accordingly with the down nodes' tokens in storage_service::join_cluster. However, a bootstrapping/replacing node has no persistent knowledge of the down nodes, and it learns about their existance only from gossip. But since the down nodes have unknown status, they never go through `handle_state_normal` (in gossiper mode) and therefore they are not accounted as normal token owners. This is handled by `topology_state_load`, but not with gossip-based node operations. This patch updates the ignored nodes (for replace) state in topology and token_metadata as if they were loaded from system tables, after calling `prepare_replacement_info` when raft topology changes are disabled, based on the endpoint_state retrieved in the shadow round initiated in prepare_replacement_info. Fixes scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the cluster is fully restarted (say after power loss), and some nodes do not come up
(for example, after losing their ephemeral storage), we end up in a state when their endpoint_state,
as initialized by gossiper::add_saved_endpoint, contains only zeroed generation and version, and the endpoint HOST_ID.
In particular, the endpoint_state is missing:
Having no DC/RACK state can potentially confuse NetworkTopologyStrategy since those will not be populated in
locator::topology
.Having no STATUS leads to the nodes being removed from gossip after the quarantine period.
The text was updated successfully, but these errors were encountered: