Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

Open
jakefhyde opened this issue Aug 30, 2023 · 7 comments
Assignees
Labels
area/capi Provisioning issues that deal correspond with CAPI area/capr/rke2 RKE2 Provisioning issues involving CAPR area/capr Provisioning issues that involve cluster-api-provider-rancher area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework internal kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes status/has-dependency status/release-note-added team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: Blocked
Milestone

Comments

@jakefhyde
Copy link
Contributor

jakefhyde commented Aug 30, 2023

Rancher Server Setup

  • Rancher version: v2.7-head

Information about the Cluster

  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Rancher provisioned RKE2

Describe the bug

v2prov currently creates all machine deployments with a machine set deletion policy of Oldest. When scaling down a machine deployment, the oldest node is deleted, which could potentially be the init node, or rather the node that is considered the leader for purposes of etcd and controlplane joining. If multiple machines are created at roughly the same, the machine which comes lexicographically first will usually become the init node. When the machine set scales down, if the oldest node was the init node, the newly elected init node will have to restart, as the previous server-url flag will be removed, as it was pointed to the old node. During this time, the node may become unhealthy, and a controller runs in CAPI which will copy the status of the v1 node in the downstream cluster to the capi machine object as the condition NodeHealthy. When the machine set controller is reconciling to determine which machines to delete, already deleting machines and unhealthy machines are sorted with the same priority, and in order to break this tie the lexicographical first is chosen. This can cause multiple machines to be deleted if machines are named in lexicographical order in accordance with their age, and the controller runs when the node becomes unhealthy.

To Reproduce

For a single node cluster, scale to 2 nodes then back to 1.
For a three node cluster, scale to 3 nodes then back to 2.

Repeat as many times as needed until issue reproduces. While theoretically possible to have more than 2 needs be deleted, I found it extremely unlikely.

Result

Multiple nodes are removed.

Expected Result

1 node is removed.

Screenshots

Additional context

This issue was discovered during testing of harvester/harvester#4358, and determined to be the root cause.

This is a bug in CAPI and is currently being tracked here: kubernetes-sigs/cluster-api#9334

Release Note for v2.7.6 / later releases as needed
There is a known issue caused by an upstream cluster-api issue with etcd node scale down operations on K3s/RKE2 machine-provisioned clusters. It is possible for the cluster-api core controllers to delete more than the desired quantity of etcd nodes when reconciling an RKE Machine Pool (see: kubernetes-sigs/cluster-api#9334 for the upstream issue that ties to this). As such, it is not recommended to scale down etcd nodes as this may inadvertently delete all etcd nodes in the pool. As always, we recommend that you have a robust backup strategy and store your etcd snapshots in a safe location.
(SURE-7042)

@jakefhyde jakefhyde added kind/bug Issues that are defects reported by users or that we know have reached a real release area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework area/capr/rke2 RKE2 Provisioning issues involving CAPR team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support area/capi Provisioning issues that deal correspond with CAPI area/capr Provisioning issues that involve cluster-api-provider-rancher labels Aug 30, 2023
@slickwarren
Copy link
Contributor

Looked over the CAPI code and have some follow up questions/points:

  • I see that the random delete policy will fix this issue. However, there will be an edge case where new nodes (create timestamp <=0) could be deleted
  • is there a typical timeline we could expect upstream to fix the issue for oldest delete policy instead of a workaround for rancher?
  • if we don't wait on CAPI to fix it, is there any reason outside of the edge case listed above that rancher would care about a not-last delete in machine pools?
  • will customers / their existing workflows or external automation be affected? I think some are depending on the last node being removed when scaling, but this customer may only be on rke1. However, other customers may be in a similar predicament
  • if we switch to random delete policy and CAPI fixes it for oldest delete policy, will we switch back to oldest or stay on random?

@sennerholm
Copy link

sennerholm commented Sep 28, 2023

It seems that a workaround could be to change to "random" instead of "oldest" policy, in our case we can probably do that with a kyverno policy, but will the "Rancher" controller try to change it back?
ie, can we use that as a workaround?

@jakefhyde
Copy link
Contributor Author

@sennerholm Unfortunately that is not a viable workaround, as rancher will set it back as you suspect.

@sennerholm
Copy link

Ok, we as customers asked for possible workarounds in a ticket to Rancher support some days ago. It would be good to have some kind of "workaround" in place, last I checked the upstream ticket it was no update on it.

@Oats87
Copy link
Contributor

Oats87 commented Oct 25, 2023

Ok, we as customers asked for possible workarounds in a ticket to Rancher support some days ago. It would be good to have some kind of "workaround" in place, last I checked the upstream ticket it was no update on it.

The most effective workaround/mitigation for this issue is to simply create multiple "machine pools" with a quantity of 1, and thus not actually using machine pool scaling. As such, if you need 3 etcd nodes, you would create 3 machine pool entries, 1 for each node.

cluster-api ordinarily expects the control plane provider (in this case, CAPR) to handle creation/manipulation of the machine objects, but v2prov/CAPR uses machine deployments for this and ends up hitting cluster-api bugs due to it.

@snasovich
Copy link
Collaborator

Moving to "Blocked" as proper fix will be addressing kubernetes-sigs/cluster-api#9334 - luckily it looks like there is some movement on this quite recently so there is a chance we may get a fix in CAPI. Also updating the milestone to circle back on this for next minor release as it will require bumping CAPI.
@jakefhyde @Oats87 , please feel free to correct / chime in with more info.

@snasovich
Copy link
Collaborator

Should be unblocked once #45090 is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capi Provisioning issues that deal correspond with CAPI area/capr/rke2 RKE2 Provisioning issues involving CAPR area/capr Provisioning issues that involve cluster-api-provider-rancher area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework internal kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes status/has-dependency status/release-note-added team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: Blocked
Projects
None yet
Development

No branches or pull requests

8 participants