Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that the replacing node is in the same DC/RACK as the node being replaced when needed #18139

Open
vladzcloudius opened this issue Apr 1, 2024 · 9 comments
Assignees
Labels
area/topology changes enhancement Field-Tier2 Important issues as per FieldEngineering request

Comments

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Apr 1, 2024

Installation details
HEAD: ce17841

Description

When one replaces a node using replace_addres_xx/replace_node_xxx parameters and by mistake uses a node that will belong to a different rack or even a different DC scylla will just allow that.

As a result, if NetworkTolopolyStrategy is used, the ownership is going to change on nodes unrelated to the node being replaces hence even if RBNO is used the data is on those nodes is not going to be streamed to reflect the new ownership.

As a result we will end with an inconsistent data and will need to run a full repair to fix that.

This is obviously not a desired behavior and such mistakes should be prevented.

We need a "safeguard" that would fail such an attempt.

@vladzcloudius
Copy link
Contributor Author

One way to implement such a safeguard is:

On the replacing node after it fetches a cluster the application state of the node being replaced and schema:

validate_topology_params = false;
For each keyspace KS {
   if (KS replication strategy is `NetworkTolopolyStrategy`) {
       validate_topology_params = true;
       break;
   }
}

if (validate_topology_params) {
    verify that this node's DC and RACK are the same as the ones of the node we are replacing
    if they are not - exit with the corresponding error message
}

@vladzcloudius vladzcloudius added the Field-Tier2 Important issues as per FieldEngineering request label Apr 1, 2024
@vladzcloudius
Copy link
Contributor Author

cc @mkeeneyj

@mykaul
Copy link
Contributor

mykaul commented Apr 2, 2024

I kinda remember @bhalevy raising a similar concern.

@mykaul
Copy link
Contributor

mykaul commented Apr 2, 2024

I kinda remember @bhalevy raising a similar concern.

#15787

@bhalevy
Copy link
Member

bhalevy commented Apr 2, 2024

#16858 more accurately.

@bhalevy
Copy link
Member

bhalevy commented Apr 2, 2024

With tablets, techically the new node has to be only in the same DC to comply with the network topology replication startegy configuration, but it better be on the same rack as well so not to screw up load balancing across racks.

To keep it simple, we should require to keep both dc and rack on replace, and if there is a need for renaming it should be done separately using a different mechanism like a snitch change.

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Apr 2, 2024

#16858 more accurately.

The description suggests that this one is a duplicate of the above.
However I'm probably a bit more explicit in regards "what I want" (I want a safeguard!).
So, up to you, guys - feel free to close this one as a duplicate but we really want this safeguard to be implemented and this is a rather trivial thing to do which will save lives!

I can add it to a FE Hackaton list for this R&D Summit maybe...

@vladzcloudius
Copy link
Contributor Author

Wait a second! There is even a patch already: bhalevy@1dc449c

Why wasn't it merged, @bhalevy ?

@bhalevy
Copy link
Member

bhalevy commented May 8, 2024

Wait a second! There is even a patch already: bhalevy@1dc449c

Why wasn't it merged, @bhalevy ?

It had a dependency on another issue and that was fixed relatively recently.
See #16862 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/topology changes enhancement Field-Tier2 Important issues as per FieldEngineering request
Projects
None yet
Development

No branches or pull requests

5 participants