"Drain Before Delete" support for RKE2 machine pools #35274

snasovich · 2021-10-26T01:33:47Z

A new flag should be supported on machine pool definition to designate if nodes should be drained before they are deleted. TBD if the option should be added to provisioning.cattle.io.clusters -> spec.machinePools[i] object or something else.

Then, (de)provisioning flow should be updated to use the value of this flag to drain nodes before they are deleted if necessary.

The text was updated successfully, but these errors were encountered:

jakefhyde · 2022-01-10T22:00:26Z

Root cause

N/A

What was fixed, or what changes have occurred

Added backend support for "Drain Before Delete" in rke2 for parity with rke. Until a corresponding UI change, the default behavior will be in line with rke1, which was to have "Drain Before Delete" be false by default.

Areas or cases that should be tested

Node pools that should not be drained will have the CAPI annotation machine.cluster.x-k8s.io/exclude-node-draining: "true", and node pools that should be drained will have no such annotation. This can be set explicitly in the yaml spec by first configuring a cluster with the UI, then in yaml edit mode under the RKEMachinePool spec setting DrainBeforeDelete: true

What areas could experience regressions?

N/A

Are the repro steps accurate/minimal?

N/A

jakefhyde · 2022-01-10T22:01:55Z

This should no longer be blocking rancher/dashboard#4448

Auston-Ivison-Suse · 2022-01-31T17:32:23Z

Feature Testing

Setup For validation
Rancher version: v2.6-head(7efbeab)
Installation: Helm HA

Downstream Cluster
Provisioned Through: RKE2 on EC2 Instances. 3 workers 3 etcd 2 control planes
k8s: v1.21.9

Usecase: Someone has an already provisioned rke2 cluster. They want to add the "Drain Before Delete" feature to their nodes.

Steps

edit config of rke2 cluster (this cluster shouldn't have Drain Before Delete)
Check mark the Drain Before Delete within the config of each node pull
press save Save (might have to refresh the page to get this to work)

Result
The cluster takes 2-3 to bring back up the nodes. When the nodes do come back up the cloud-controller-manager pod for the etcds are unhealthy.

Logs of Pod:

I0131 17:20:15.293940 1 serving.go:354] Generated self-signed cert in-memory
W0131 17:20:15.530382 1 authentication.go:419] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0131 17:20:15.530415 1 authentication.go:316] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0131 17:20:15.530426 1 authentication.go:340] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0131 17:20:15.530441 1 authorization.go:225] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0131 17:20:15.530458 1 authorization.go:193] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0131 17:20:15.543191 1 controllermanager.go:142] Version: v1.22.1-k3s1
I0131 17:20:15.544846 1 secure_serving.go:200] Serving securely on <ip-address>
I0131 17:20:15.546583 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0131 17:20:15.547822 1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...

Reopened this on account of the above issues seen.

anupama2501 · 2022-02-24T04:52:45Z

Verified test cases on v2.6-head 8c785a1

Scale down by 1 worker node; Do not Drain the node >> Failed Scale Node Pool Up or Down for RKE2 provisioned clusters is not working for long cluster/node pool names dashboard#4990 (comment)
Scale down by more than 1 worker nodes; Do not Drain the nodes >> Failed Scale Node Pool Up or Down for RKE2 provisioned clusters is not working for long cluster/node pool names dashboard#4990 (comment)
Scale down by 1 worker node; Drain the node - Failed Drain on delete - Node is stuck in deleting when tried to scale down #36631
Scale down by more than 1 worker nodes; Drain the nodes - Pass
Scale up worker nodes - Pass
Scale down Control plane nodes - Pass
Scale down etcd nodes - Pass
Scale down by clicking on "-" - Pass
Scale down is not available for custom cluster - Pass
Verify by deleting the node pool - Failed >> Delete node pool for rke2 node driver clusters dashboard#5187
Verify the nodes status change from removing >> cordoned - Failed Nodes states do no change to cordon and removing during scaled down when drain on delete is enabled dashboard#5220
Verified editing nodes and adding drain on delete >> Scale down/Scale up - Pass [As mentioned in the below comment, on an edit the control plane nodes and worker nodes are recreated]

anupama2501 · 2022-03-01T16:55:53Z

@jakefhyde If we edit the cluster and enable drain on delete, the existing control plane nodes and worker are deleted and new nodes are created. This is not the case for etcd nodes. Is this expected?

snasovich · 2022-03-01T17:44:39Z

Adding release-note as it's definitely a feature we want to call out on release notes.
But also we should consider nothing the behavior explained in #35274 (comment) as I believe it's rather unexpected for the unsuspecting users.
As for changing this behavior, it's not really an option as it's how CAPI works - similar to how Deployments / Pods work - if you update deployment, pods are recreated.

snasovich added area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework area/capr/rke2 RKE2 Provisioning issues involving CAPR labels Oct 26, 2021

snasovich added this to the v2.6.3 milestone Oct 26, 2021

snasovich mentioned this issue Oct 26, 2021

"Drain Before Delete" support for RKE2 machine pools rancher/dashboard#4448

Closed

snasovich assigned thedadams Oct 26, 2021

snasovich added the [zube]: Next Up label Oct 26, 2021

snasovich modified the milestones: v2.6.3, v2.6.4 Nov 9, 2021

deniseschannon added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Nov 23, 2021

deniseschannon modified the milestones: v2.6.4, v2.6.4 - Triaged Dec 1, 2021

deniseschannon added [zube]: Team Area 2 and removed [zube]: Next Up labels Dec 1, 2021

snasovich assigned jakefhyde and unassigned thedadams Dec 17, 2021

snasovich added the [zube]: Next Up label Dec 17, 2021

zube bot removed the [zube]: Team Area 2 label Dec 17, 2021

snasovich added [zube]: Working and removed [zube]: Next Up labels Dec 20, 2021

jakefhyde mentioned this issue Dec 22, 2021

Add rke2 support for drain before delete #35957

Merged

snasovich added the [zube]: Review label Dec 29, 2021

zube bot removed the [zube]: Working label Dec 29, 2021

Sahota1225 mentioned this issue Jan 5, 2022

[Group] RKE2 Provisioning parity work for RKE2 Provisioning GA #36044

Closed

24 tasks

jakefhyde added the [zube]: To Test label Jan 10, 2022

zube bot removed the [zube]: Review label Jan 10, 2022

sowmyav27 assigned Auston-Ivison-Suse Jan 16, 2022

slickwarren added the QA/S label Jan 27, 2022

Auston-Ivison-Suse removed the [zube]: To Test label Jan 29, 2022

Auston-Ivison-Suse added the [zube]: QA Working label Jan 29, 2022

Auston-Ivison-Suse added [zube]: Reopened and removed [zube]: QA Working labels Jan 31, 2022

jakefhyde mentioned this issue Feb 7, 2022

Don't drain etcd nodes #36411

Merged

jakefhyde added the [zube]: To Test label Feb 8, 2022

zube bot removed the [zube]: Reopened label Feb 8, 2022

slickwarren assigned anupama2501 and unassigned Auston-Ivison-Suse Feb 10, 2022

sowmyav27 added [zube]: QA Next up and removed [zube]: To Test labels Feb 21, 2022

anupama2501 added [zube]: QA Working and removed [zube]: QA Next up labels Feb 22, 2022

jakefhyde mentioned this issue Feb 24, 2022

[FEATURE] rke2/k3s Drain Before Delete rancher/terraform-provider-rancher2#873

Closed

snasovich added the release-note Note this issue in the milestone's release notes label Mar 1, 2022

anupama2501 closed this as completed Mar 1, 2022

zube bot added [zube]: Done and removed [zube]: QA Working labels Mar 1, 2022

anupama2501 mentioned this issue Mar 10, 2022

Nodes states do no change to cordon and removing during scaled down when drain on delete is enabled rancher/dashboard#5220

Closed

sowmyav27 mentioned this issue Sep 13, 2022

[Rancher2] Documentation for RKE2 provisioning rancher/rancher-docs#79

Open

37 tasks

zube bot removed the [zube]: Done label May 31, 2022

Oats87 mentioned this issue Jul 31, 2023

[v2prov] machine provisioned etcd nodes do not drain before deletion #42271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Drain Before Delete" support for RKE2 machine pools #35274

"Drain Before Delete" support for RKE2 machine pools #35274

snasovich commented Oct 26, 2021 •

edited

Loading

jakefhyde commented Jan 10, 2022 •

edited

Loading

jakefhyde commented Jan 10, 2022

Auston-Ivison-Suse commented Jan 31, 2022 •

edited

Loading

anupama2501 commented Feb 24, 2022 •

edited

Loading

anupama2501 commented Mar 1, 2022

snasovich commented Mar 1, 2022

"Drain Before Delete" support for RKE2 machine pools #35274

"Drain Before Delete" support for RKE2 machine pools #35274

Comments

snasovich commented Oct 26, 2021 • edited Loading

jakefhyde commented Jan 10, 2022 • edited Loading

Root cause

What was fixed, or what changes have occurred

Areas or cases that should be tested

What areas could experience regressions?

Are the repro steps accurate/minimal?

jakefhyde commented Jan 10, 2022

Auston-Ivison-Suse commented Jan 31, 2022 • edited Loading

anupama2501 commented Feb 24, 2022 • edited Loading

anupama2501 commented Mar 1, 2022

snasovich commented Mar 1, 2022

snasovich commented Oct 26, 2021 •

edited

Loading

jakefhyde commented Jan 10, 2022 •

edited

Loading

Auston-Ivison-Suse commented Jan 31, 2022 •

edited

Loading

anupama2501 commented Feb 24, 2022 •

edited

Loading