Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Drain Before Delete" support for RKE2 machine pools #35274

Closed
snasovich opened this issue Oct 26, 2021 · 6 comments
Closed

"Drain Before Delete" support for RKE2 machine pools #35274

snasovich opened this issue Oct 26, 2021 · 6 comments
Assignees
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework QA/S release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@snasovich
Copy link
Collaborator

snasovich commented Oct 26, 2021

A new flag should be supported on machine pool definition to designate if nodes should be drained before they are deleted. TBD if the option should be added to provisioning.cattle.io.clusters -> spec.machinePools[i] object or something else.

Then, (de)provisioning flow should be updated to use the value of this flag to drain nodes before they are deleted if necessary.

@snasovich snasovich added area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework area/capr/rke2 RKE2 Provisioning issues involving CAPR labels Oct 26, 2021
@snasovich snasovich added this to the v2.6.3 milestone Oct 26, 2021
@snasovich snasovich modified the milestones: v2.6.3, v2.6.4 Nov 9, 2021
@deniseschannon deniseschannon added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Nov 23, 2021
@deniseschannon deniseschannon modified the milestones: v2.6.4, v2.6.4 - Triaged Dec 1, 2021
@snasovich snasovich assigned jakefhyde and unassigned thedadams Dec 17, 2021
@zube zube bot removed the [zube]: Working label Dec 29, 2021
@jakefhyde
Copy link
Contributor

jakefhyde commented Jan 10, 2022

Root cause

N/A

What was fixed, or what changes have occurred

Added backend support for "Drain Before Delete" in rke2 for parity with rke. Until a corresponding UI change, the default behavior will be in line with rke1, which was to have "Drain Before Delete" be false by default.

Areas or cases that should be tested

Node pools that should not be drained will have the CAPI annotation machine.cluster.x-k8s.io/exclude-node-draining: "true", and node pools that should be drained will have no such annotation. This can be set explicitly in the yaml spec by first configuring a cluster with the UI, then in yaml edit mode under the RKEMachinePool spec setting DrainBeforeDelete: true

What areas could experience regressions?

N/A

Are the repro steps accurate/minimal?

N/A

@jakefhyde
Copy link
Contributor

This should no longer be blocking rancher/dashboard#4448

@Auston-Ivison-Suse
Copy link

Auston-Ivison-Suse commented Jan 31, 2022

Feature Testing

Setup For validation
Rancher version: v2.6-head(7efbeab)
Installation: Helm HA

Downstream Cluster
Provisioned Through: RKE2 on EC2 Instances. 3 workers 3 etcd 2 control planes
k8s: v1.21.9

Usecase: Someone has an already provisioned rke2 cluster. They want to add the "Drain Before Delete" feature to their nodes.

Steps

  1. edit config of rke2 cluster (this cluster shouldn't have Drain Before Delete)
  2. Check mark the Drain Before Delete within the config of each node pull
  3. press save Save (might have to refresh the page to get this to work)

Result
The cluster takes 2-3 to bring back up the nodes. When the nodes do come back up the cloud-controller-manager pod for the etcds are unhealthy.

Logs of Pod:

I0131 17:20:15.293940 1 serving.go:354] Generated self-signed cert in-memory
W0131 17:20:15.530382 1 authentication.go:419] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0131 17:20:15.530415 1 authentication.go:316] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0131 17:20:15.530426 1 authentication.go:340] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0131 17:20:15.530441 1 authorization.go:225] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0131 17:20:15.530458 1 authorization.go:193] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0131 17:20:15.543191 1 controllermanager.go:142] Version: v1.22.1-k3s1
I0131 17:20:15.544846 1 secure_serving.go:200] Serving securely on <ip-address>
I0131 17:20:15.546583 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0131 17:20:15.547822 1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...

Reopened this on account of the above issues seen.

@anupama2501
Copy link
Contributor

anupama2501 commented Feb 24, 2022

Verified test cases on v2.6-head 8c785a1

  1. Scale down by 1 worker node; Do not Drain the node >> Failed Scale Node Pool Up or Down for RKE2 provisioned clusters is not working for long cluster/node pool names dashboard#4990 (comment)
  2. Scale down by more than 1 worker nodes; Do not Drain the nodes >> Failed Scale Node Pool Up or Down for RKE2 provisioned clusters is not working for long cluster/node pool names dashboard#4990 (comment)
  3. Scale down by 1 worker node; Drain the node - Failed Drain on delete - Node is stuck in deleting when tried to scale down #36631
  4. Scale down by more than 1 worker nodes; Drain the nodes - Pass
  5. Scale up worker nodes - Pass
  6. Scale down Control plane nodes - Pass
  7. Scale down etcd nodes - Pass
  8. Scale down by clicking on "-" - Pass
  9. Scale down is not available for custom cluster - Pass
  10. Verify by deleting the node pool - Failed >> Delete node pool for rke2 node driver clusters dashboard#5187
  11. Verify the nodes status change from removing >> cordoned - Failed Nodes states do no change to cordon and removing during scaled down when drain on delete is enabled dashboard#5220
  12. Verified editing nodes and adding drain on delete >> Scale down/Scale up - Pass [As mentioned in the below comment, on an edit the control plane nodes and worker nodes are recreated]

@anupama2501
Copy link
Contributor

@jakefhyde If we edit the cluster and enable drain on delete, the existing control plane nodes and worker are deleted and new nodes are created. This is not the case for etcd nodes. Is this expected?

@snasovich snasovich added the release-note Note this issue in the milestone's release notes label Mar 1, 2022
@snasovich
Copy link
Collaborator Author

Adding release-note as it's definitely a feature we want to call out on release notes.
But also we should consider nothing the behavior explained in #35274 (comment) as I believe it's rather unexpected for the unsuspecting users.
As for changing this behavior, it's not really an option as it's how CAPI works - similar to how Deployments / Pods work - if you update deployment, pods are recreated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework QA/S release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
Development

No branches or pull requests

8 participants