Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport v0.7] Clean up existing ClusterRegistrations on Fleet Upgrade #1692

Closed
rancherbot opened this issue Aug 2, 2023 · 1 comment
Closed

Comments

@rancherbot
Copy link
Collaborator

rancherbot commented Aug 2, 2023

This is a backport issue for #1690, automatically created via rancherbot by @manno

Original issue description:

This is an extension to #1651.
Should also fix #1674
It needs a backport to 0.7.x.

Implemented by: #1689

Fleet 0.7.0 creates multiple clusterregistration resources and does not clean them up. This adds a helm hook to run a a clean up script when upgrading Fleet.

We assume agents are only using the latest clusterregistration and clean up the others. The script does not check if a registration was granted. It does try to delete the child resources, too. If the fleet-controller is running, its clean up handler would also delete the orphaned resources. The script works over all namespaces.

The migration job can be disabled via helm values.

Testing

  • install a rancher/fleet version which does not have the automatic clean up after registration, e.g. 2.7.5
  • create a situation where there are multiple outdated clusterregistration, e.g. by forcing agent redeployments a few times:
#!/bin/bash

ns=${1:-fleet-local}
name=${2:-local}
kubectl patch clusters.fleet.cattle.io -n "$ns" "$name" --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": '$RANDOM'}]'
  • try to have some outdated registrations for clusters, that are deleted. Probably by creating lots of registrations, stopping the fleet controller and deleting the clusters.fleet.cattle.io (or the whole cluster in Rancher?) manually.
  • upgrade to a fleet version with the clean up upgrade job and see that all outdated clusterregistrations are removed
  • existing agents are still registered and can connect to the upstream API server, can be checked by deploying a new bundle

Engineering Testing

Manual Testing

Upgraded fleet standalone multiple times and watched the job spawn. Checked with helm template if the new value work.

QA Testing Considerations

The clean up script might use a lot of resources and run for a long time if cleaning up lots of (20k+) resources.
It should be fine for smaller fleets (<20 clusters).

Regressions Considerations

Some fleets might have too many resources for an automatic clean up to be effective?

@sbulage
Copy link

sbulage commented Aug 16, 2023

To check below points.

  • Cleanup of cluster resources (old) while performing the upgrade of Rancher/Fleet in the cluster.
  • QA template followed from description.

I used the script given by @manno to create 100+ clusterregistrations in the cluster.

In order to reproduce the issue following steps were performed.

  • I kept cluster for almost days (long weekend in between) and observed the current cluster resources and cluster registrations.
  • Upgrade performed from Rancher 2.7.5 to Rancher 2.7.6-rc2.

Observations

  • Before Upgrade
    Using the Released version of Rancher and fleet as mentioned below:
    Rancher: v2.7.5
    Fleet: v0.7.0
    
    • In between those days, added 4 GitRepo in the cluster.
    • Obsevered the cluster registrations before patching the clusterregistrations.
    • Observed that the Role and RoleBindings were increased significantly.
    • Every time I execute above command, it creates new cluster registrations without deleting old one.
    • In my setup the clusterregistrations increased from 4 to 414.
    • Other resources were also increased which has created by clusterregistrations.
    • Deleted the one of the cluster from the clusters.fleet.cattle.io
      kubectl delete clusters.fleet.cattle.io -n fleet-default imported-cluster-2
      

After observing this sitution over the days, I upgraded to the latest Rancher RC version and fleet RC version in which the fix is available.

  • After Upgrade

    Rancher: v2.7.6-rc2
    Fleet: 0.7.1-rc.2
    
    • While upgrade was happening observed that clusterregistrations went down to 4.
    • Before upgrade cluster deleted from clusters.fleet.cattle.io got re-added to fleet.
    • Re-registrationing of cluster deploying new fleet-agent everytime. and which can be seen in the fleet-controller logs.
    • There were no harm to the existing resources added by the GitRepo while upgrading it to the Rancher 2.7.6-rc2.
    • After upgrade imported cluster clusterspecs are working as expected.
    • Updated Cluster spec of the imported clusters.(ClusterSpec).
    • After every clusterSpec update, I started fleet-controller, I see fleet-agent is re-created on imported clusters with the updated spec configurations.
    • Executed below command to check clusterregistrations are get deleted or not.
      kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'
      
    • In the fleet-controller logs, the deletion of old clusterregistrations and pointing to the new one is shown.

P.S. In above testing, P0 and regression tests performed on the cluster after upgrade.

  • Below table shows cleanup of resources while upgrade.
Resources Before upgrade After Upgrade
ClusterRoleBindings 512 110
Cluster Roles 136 143
RoleBindings 905 94
Roles 482 80

@zube zube bot removed the [zube]: Done label Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants