Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to scale down individual node(s) for RKE2-provisioned clusters #4446

Closed
Tracked by #3346
snasovich opened this issue Oct 25, 2021 · 13 comments · Fixed by #5200
Closed
Tracked by #3346

Ability to scale down individual node(s) for RKE2-provisioned clusters #4446

snasovich opened this issue Oct 25, 2021 · 13 comments · Fixed by #5200

Comments

@snasovich
Copy link
Contributor

snasovich commented Oct 25, 2021

Detailed Description
For RKE1-provisioned clusters, there is currently an option to scale down specific nodes(s).
As dropdown action for a single node:
image
As an action button for multiple selected nodes:
image

The same functionality should be available for RKE2-provisioned clusters.

Context
This is needed for RKE2 provisioning parity with RKE1

Additional Details
It should be possible to achieve this by setting cluster.k8s.io/delete-machine annotation on the node(s) to be scaled down before calling back-end to update node pool(s) with an appropriate number of node counts per each affected node pool.

Also, it looks like RKE1 case may be allowing invalid deletion requests (like scaling down the only control plane node). It would be nice to avoid such issues in RKE2 implementation. For example, I've managed to break my RKE1 cluster by attempting to scale down the only CP node and then scaling it back up (interestingly it was stuck at "Waiting for node to be removed from cluster" and operational until I attempted to scale node pool back up).
image

@snasovich snasovich added this to the v2.6.3 milestone Oct 25, 2021
@gaktive
Copy link
Member

gaktive commented Nov 5, 2021

Per @vincent99, this should be relatively easy. It should be a matter of selecting each of the nodes and set the annotation on them. This would then allow scaling to happen as expected.

@gaktive
Copy link
Member

gaktive commented Nov 5, 2021

@snasovich we'll push this to 2.6.4 for now but if you do need this, Vince does have some capacity.

@richard-cox
Copy link
Member

@snasovich To confirm what steve/norman resource should the cluster.k8s.io/delete-machine annotation be set on (v1/management.cattle.io.nodes, v3/nodes, etc)?

I've tried setting the annotation to "true" on cluster.x-k8s.io.machine and then scaling the pool via the normal scaling call on the cluster.x-k8s.io.machinedeployment

@snasovich
Copy link
Contributor Author

@thedadams , could you please help answering Richard's question above?

@thedadams
Copy link

@richard-cox Sorry, there was a typo in Sergey's original message. The annotation does go on the cluster.x-k8s.io.machine object, but the annotation is cluster.x-k8s.io/delete-machine.

@Auston-Ivison-Suse
Copy link

Auston-Ivison-Suse commented Mar 1, 2022

Further Testing
Rancher setup: rancher version: v2.6-head(e93a53c)
Downstream cluster: RKE2 EC2, k8s: v1.21.9+rke2r1

Previously Failed Test Cases:

  • In an rke2 cluster - 1etcd, 1 cp, 2 worker nodes --> Scale down ETcd nodes. User should NOT be able to Scale Down Etcd nodes (Please note the behavior when the etcd nodes is 1) This now Passes

Further testing
Why are we given the chance to delete a single node from the kebab menus next to a singular node?

While doing this then attempting to scale the node up no longer has a node reference when attempting to bring up another cluster.

To Repeat This Issue

  1. Navigate to cluster management and machine pools
  2. Click the kebab menu on a single node cluster component (i.e. etcd)
  3. press delete

Relevant ScreenShots
KebabMenu
MachinePools

@Auston-Ivison-Suse
Copy link

@richard-cox do you think the last comment is an issue?

@richard-cox
Copy link
Member

@richard-cox do you think the last comment is an issue?

This should be fine. From my understanding a deleted machine should come back, so may be helpful if that instance is misbehaving. Whereas a scaled down machine will never come back and is permanent

@Auston-Ivison-Suse
Copy link

Setup For Testing
Rancher setup: rancher version: v2.6-head(c49139d)
Downstream cluster: RKE2 EC2, k8s: v1.21.9+rke2r1

Failed Test Cases:

  • In an rke2 cluster - 1etcd, 1 cp, 2 worker nodes --> Scale down ETcd nodes. User should NOT be able to Scale Down Etcd nodes (Please note the behavior when the etcd nodes is 1)

Debugging

So the option to scale down etcd with a singular etcd is available.

The moment you scale the node down you will get the following error:

Could not scrape join URL from periodic output (exit code: 0, length: 0) for machine auston-rke2-auston-etcd-645f8895b4-w4ppf Over the cluster.

It appears the node still exists in the node driver's machine provider so it wasn't fully deleted.
But within rancher the deleting node hangs.
You can also edit the config and bring up another etcd node, this will appear to remove the etcd node from rancher, but the deleted etcd still exists within the machine provider.

Screenshots

DeletingEtcd.png

DeletingEtcdError.png

@Auston-Ivison-Suse
Copy link

Moving to done seeing as @richard-cox says the expected behavior was seen in my testing and the previously failed test case now passes.

@jtravee
Copy link

jtravee commented Mar 16, 2022

Confirmed with @catherineluse and @gaktive to add release note label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment