-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for automatic replacement of unreachable nodes in RKE2 machine pool #35275
Comments
Root causenew feature What was fixed, or what changes have occurredHealth checks were added to support self healing node pools Areas or cases that should be tested
What areas could experience regressions?
Are the repro steps accurate/minimal?see this comment for test cases #35916 (comment) |
Test Environment:Rancher version: v2.6-head e55a04c Downstream cluster type: RKE2 AWS Node driver Testing:Tested this issue with the following scenarios:
Result Still seem to getting the same issue k8s v1.21 - Low Pass
k8s v1.22 - Low Pass The results are same as v1.21 where cluster is stuck in "Provisioning" state and all other scenarios pass |
@timhaneunsoo I tested this and it is working as expected. node never goes active and never gets node ref
node goes active and then inactive inside node startup timeout
node goes inactive outside of startup timeout
|
After further review, I did not account for the cluster to be active before shutting down a node, so the health checks could not take affect, resulting in the cluster being stuck in "Provisioning" All scenarios are now working as expected. |
A new field should be supported on machine pool definition to specify the threshold for how long a node can stay unreachable before it's replaced. TBD if the option should be added to
provisioning.cattle.io.clusters -> spec.machinePools[i]
object or something else.Then, support for logic utilizing this new field should be added to the V2 provisioning engine. This should be supported on CAPI (Cluster API) side per kubernetes-sigs/cluster-api#1990.
The text was updated successfully, but these errors were encountered: