Support for automatic replacement of unreachable nodes in RKE2 machine pool #35275

snasovich · 2021-10-26T01:54:36Z

A new field should be supported on machine pool definition to specify the threshold for how long a node can stay unreachable before it's replaced. TBD if the option should be added to provisioning.cattle.io.clusters -> spec.machinePools[i] object or something else.

Then, support for logic utilizing this new field should be added to the V2 provisioning engine. This should be supported on CAPI (Cluster API) side per kubernetes-sigs/cluster-api#1990.

The text was updated successfully, but these errors were encountered:

paynejacob · 2022-01-21T16:17:37Z

Root cause

new feature

What was fixed, or what changes have occurred

Health checks were added to support self healing node pools

Areas or cases that should be tested

node failures
upgrades
scaling up and down

What areas could experience regressions?

upgrades
scaling up and down

Are the repro steps accurate/minimal?

see this comment for test cases #35916 (comment)

timhaneunsoo · 2022-02-17T18:26:23Z

Upon further testing, the following was found:

maxUnhealthy is set to intstr.IntOrString but the required value is int or percentage.

denied the request: MachineHealthCheck.cluster.x-k8s.io "timrke2122-timrke2cp" is invalid: spec.maxUnhealthy: Invalid value: intstr.IntOrString{Type:1, IntVal:0, StrVal:"3"}: must be either an int or a percentage: invalid value for IntOrString: invalid type: string is not a percentage

When passing an integer into the yaml, user is stopped by the following error,

Unable to test other scenarios due to maxUnhealthy issue.

timhaneunsoo · 2022-02-22T20:29:17Z

Test Environment:

Rancher version: v2.6-head e55a04c
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: RKE2 AWS Node driver

Testing:

Tested this issue with the following scenarios:

upgraded k8s version
forced node failure inside nodeStartupTimeout: node was replaced after nodeStartupTimeout
forced node failure outside of nodeStartupTimeout: node was replaced within 1 to 2 UnhealthyNodeTimeout periods
forced multiple node failures exceeding MaxUnhealthy: nodes were marked failed but not replaced. After - manually deleting enough nodes to be below the MaxUnhealthy: the remaining nodes were automatically replaced.

Result

Still seem to getting the same issue

k8s v1.21 - Low Pass

Cluster seems to change state to "Provisioning" but does not seem to be provisioning another node. Cluster is stuck in this state and nothing happens. - Fail
forced node failure outside of nodeStartupTimeout: node was replaced within 1 to 2 UnhealthyNodeTimeout periods - Pass
forced multiple node failures exceeding MaxUnhealthy: nodes were marked failed but not replaced. After - manually deleting enough nodes to be below the MaxUnhealthy: the remaining nodes were automatically replaced but stuck in provisioning (view image below). - Low Pass

k8s v1.22 - Low Pass

The results are same as v1.21 where cluster is stuck in "Provisioning" state and all other scenarios pass

paynejacob · 2022-02-23T22:07:43Z

@timhaneunsoo I tested this and it is working as expected.

node never goes active and never gets node ref

create a cluster with a node pool with "nodeStartupTimeout": "15m" and "unhealthyNodeTimeout": "30s".
before the node is able to get a node ref shut it down outside of rancher
wait ~20m or up to 30m and the node is deleted and replaced.

node goes active and then inactive inside node startup timeout

create a cluster with a node pool with "nodeStartupTimeout": "15m" and "unhealthyNodeTimeout": "30s".
wait for the node to go to a ready state (for me this was ~8m)
shutdown the node outside of rancher
the node is replaced within 1m of going unavailable

node goes inactive outside of startup timeout

create a cluster with a node pool with "nodeStartupTimeout": "15m" and "unhealthyNodeTimeout": "30s".
wait for 20m and ensure the node is in an unready state.
shutdown the node outside of rancher
the node is replaced within 1m of going unavailable

timhaneunsoo · 2022-02-23T22:35:55Z

After further review, I did not account for the cluster to be active before shutting down a node, so the health checks could not take affect, resulting in the cluster being stuck in "Provisioning" All scenarios are now working as expected.

snasovich added area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework area/capr/rke2 RKE2 Provisioning issues involving CAPR labels Oct 26, 2021

snasovich added this to the v2.6.3 milestone Oct 26, 2021

snasovich assigned thedadams Oct 26, 2021

snasovich added the [zube]: Next Up label Oct 26, 2021

snasovich changed the title ~~Support for automatic replacement of unhealthy nodes in RKE2 machine pool~~ Support for automatic replacement of unreachable nodes in RKE2 machine pool Oct 26, 2021

snasovich mentioned this issue Oct 26, 2021

"Auto Replace" option support for RKE2 machine pools rancher/dashboard#4449

Closed

snasovich modified the milestones: v2.6.3, v2.6.4 Nov 9, 2021

deniseschannon added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Nov 23, 2021

deniseschannon modified the milestones: v2.6.4, v2.6.4 - Triaged Dec 1, 2021

deniseschannon added [zube]: Team Area 2 and removed [zube]: Next Up labels Dec 1, 2021

snasovich assigned paynejacob and unassigned thedadams Dec 13, 2021

paynejacob added the [zube]: Working label Dec 13, 2021

zube bot removed the [zube]: Team Area 2 label Dec 13, 2021

paynejacob mentioned this issue Dec 17, 2021

self-healing machine pools for rke2 #35916

Merged

Sahota1225 mentioned this issue Jan 5, 2022

[Group] RKE2 Provisioning parity work for RKE2 Provisioning GA #36044

Closed

24 tasks

paynejacob added [zube]: Review and removed [zube]: Working labels Jan 7, 2022

paynejacob added release-note Note this issue in the milestone's release notes [zube]: To Test and removed [zube]: Review labels Jan 21, 2022

slickwarren assigned Auston-Ivison-Suse Jan 27, 2022

slickwarren added the QA/S label Jan 27, 2022

sowmyav27 assigned timhaneunsoo and unassigned Auston-Ivison-Suse Feb 13, 2022

sowmyav27 added [zube]: Reopened and removed [zube]: Review labels Feb 17, 2022

paynejacob added [zube]: To Test and removed [zube]: Reopened labels Feb 17, 2022

timhaneunsoo added [zube]: Reopened and removed [zube]: To Test labels Feb 17, 2022

paynejacob added [zube]: Working and removed [zube]: Reopened labels Feb 17, 2022

paynejacob mentioned this issue Feb 17, 2022

fix health check serialization #36557

Merged

paynejacob added [zube]: Review and removed [zube]: Working labels Feb 17, 2022

sowmyav27 added [zube]: QA Next up and removed [zube]: To Test labels Feb 21, 2022

timhaneunsoo added [zube]: Reopened and removed [zube]: QA Next up labels Feb 22, 2022

paynejacob added the [zube]: Working label Feb 23, 2022

zube bot removed the [zube]: Reopened label Feb 23, 2022

timhaneunsoo closed this as completed Feb 23, 2022

timhaneunsoo added [zube]: Done and removed [zube]: Working labels Feb 23, 2022

jakefhyde mentioned this issue Feb 24, 2022

[FEATURE] rke2/k3s self healing node pools rancher/terraform-provider-rancher2#874

Closed

sowmyav27 mentioned this issue Sep 13, 2022

[Rancher2] Documentation for RKE2 provisioning rancher/rancher-docs#79

Open

37 tasks

zube bot removed the [zube]: Done label May 25, 2022

sowmyav27 mentioned this issue Sep 27, 2022

Add test for auto-replace for RKE2 and k3s prov v2 rancher/qa-tasks#496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for automatic replacement of unreachable nodes in RKE2 machine pool #35275

Support for automatic replacement of unreachable nodes in RKE2 machine pool #35275

snasovich commented Oct 26, 2021 •

edited

Loading

paynejacob commented Jan 21, 2022

timhaneunsoo commented Feb 17, 2022

timhaneunsoo commented Feb 22, 2022 •

edited

Loading

paynejacob commented Feb 23, 2022

timhaneunsoo commented Feb 23, 2022

Support for automatic replacement of unreachable nodes in RKE2 machine pool #35275

Support for automatic replacement of unreachable nodes in RKE2 machine pool #35275

Comments

snasovich commented Oct 26, 2021 • edited Loading

paynejacob commented Jan 21, 2022

Root cause

What was fixed, or what changes have occurred

Areas or cases that should be tested

What areas could experience regressions?

Are the repro steps accurate/minimal?

timhaneunsoo commented Feb 17, 2022

timhaneunsoo commented Feb 22, 2022 • edited Loading

Test Environment:

Testing:

paynejacob commented Feb 23, 2022

node never goes active and never gets node ref

node goes active and then inactive inside node startup timeout

node goes inactive outside of startup timeout

timhaneunsoo commented Feb 23, 2022

snasovich commented Oct 26, 2021 •

edited

Loading

timhaneunsoo commented Feb 22, 2022 •

edited

Loading