Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for automatic replacement of unreachable nodes in RKE2 machine pool #35275

Closed
snasovich opened this issue Oct 26, 2021 · 6 comments
Closed
Assignees
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework QA/S release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@snasovich
Copy link
Collaborator

snasovich commented Oct 26, 2021

A new field should be supported on machine pool definition to specify the threshold for how long a node can stay unreachable before it's replaced. TBD if the option should be added to provisioning.cattle.io.clusters -> spec.machinePools[i] object or something else.

Then, support for logic utilizing this new field should be added to the V2 provisioning engine. This should be supported on CAPI (Cluster API) side per kubernetes-sigs/cluster-api#1990.

@snasovich snasovich added area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework area/capr/rke2 RKE2 Provisioning issues involving CAPR labels Oct 26, 2021
@snasovich snasovich added this to the v2.6.3 milestone Oct 26, 2021
@snasovich snasovich changed the title Support for automatic replacement of unhealthy nodes in RKE2 machine pool Support for automatic replacement of unreachable nodes in RKE2 machine pool Oct 26, 2021
@snasovich snasovich modified the milestones: v2.6.3, v2.6.4 Nov 9, 2021
@deniseschannon deniseschannon added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Nov 23, 2021
@deniseschannon deniseschannon modified the milestones: v2.6.4, v2.6.4 - Triaged Dec 1, 2021
@snasovich snasovich assigned paynejacob and unassigned thedadams Dec 13, 2021
@paynejacob
Copy link
Contributor

Root cause

new feature

What was fixed, or what changes have occurred

Health checks were added to support self healing node pools

Areas or cases that should be tested

  • node failures
  • upgrades
  • scaling up and down

What areas could experience regressions?

  • upgrades
  • scaling up and down

Are the repro steps accurate/minimal?

see this comment for test cases #35916 (comment)

@paynejacob paynejacob added release-note Note this issue in the milestone's release notes [zube]: To Test and removed [zube]: Review labels Jan 21, 2022
@timhaneunsoo
Copy link

Upon further testing, the following was found:

maxUnhealthy is set to intstr.IntOrString but the required value is int or percentage.

denied the request: MachineHealthCheck.cluster.x-k8s.io "timrke2122-timrke2cp" is invalid: spec.maxUnhealthy: Invalid value: intstr.IntOrString{Type:1, IntVal:0, StrVal:"3"}: must be either an int or a percentage: invalid value for IntOrString: invalid type: string is not a percentage

When passing an integer into the yaml, user is stopped by the following error,
Screen Shot 2022-02-17 at 1.24.33 PM.png

Unable to test other scenarios due to maxUnhealthy issue.

@timhaneunsoo
Copy link

timhaneunsoo commented Feb 22, 2022

Test Environment:

Rancher version: v2.6-head e55a04c
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: RKE2 AWS Node driver


Testing:

Tested this issue with the following scenarios:

  • upgraded k8s version
  • forced node failure inside nodeStartupTimeout: node was replaced after nodeStartupTimeout
  • forced node failure outside of nodeStartupTimeout: node was replaced within 1 to 2 UnhealthyNodeTimeout periods
  • forced multiple node failures exceeding MaxUnhealthy: nodes were marked failed but not replaced. After - manually deleting enough nodes to be below the MaxUnhealthy: the remaining nodes were automatically replaced.

Result

Still seem to getting the same issue

k8s v1.21 - Low Pass

  • Cluster seems to change state to "Provisioning" but does not seem to be provisioning another node. Cluster is stuck in this state and nothing happens. - Fail
    Screen Shot 2022-02-22 at 3.26.00 PM.png

  • forced node failure outside of nodeStartupTimeout: node was replaced within 1 to 2 UnhealthyNodeTimeout periods - Pass

  • forced multiple node failures exceeding MaxUnhealthy: nodes were marked failed but not replaced. After - manually deleting enough nodes to be below the MaxUnhealthy: the remaining nodes were automatically replaced but stuck in provisioning (view image below). - Low Pass
    Screen Shot 2022-02-22 at 6.58.07 PM.png

k8s v1.22 - Low Pass

The results are same as v1.21 where cluster is stuck in "Provisioning" state and all other scenarios pass

@paynejacob
Copy link
Contributor

@timhaneunsoo I tested this and it is working as expected.

node never goes active and never gets node ref

  1. create a cluster with a node pool with "nodeStartupTimeout": "15m" and "unhealthyNodeTimeout": "30s".
  2. before the node is able to get a node ref shut it down outside of rancher
  3. wait ~20m or up to 30m and the node is deleted and replaced.

node goes active and then inactive inside node startup timeout

  1. create a cluster with a node pool with "nodeStartupTimeout": "15m" and "unhealthyNodeTimeout": "30s".
  2. wait for the node to go to a ready state (for me this was ~8m)
  3. shutdown the node outside of rancher
  4. the node is replaced within 1m of going unavailable

node goes inactive outside of startup timeout

  1. create a cluster with a node pool with "nodeStartupTimeout": "15m" and "unhealthyNodeTimeout": "30s".
  2. wait for 20m and ensure the node is in an unready state.
  3. shutdown the node outside of rancher
  4. the node is replaced within 1m of going unavailable

@timhaneunsoo
Copy link

After further review, I did not account for the cluster to be active before shutting down a node, so the health checks could not take affect, resulting in the cluster being stuck in "Provisioning" All scenarios are now working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework QA/S release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

8 participants