Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #35750

Closed
SheilaghM opened this issue Dec 4, 2021 · 6 comments
Assignees
Labels
area/provisioning-rke1 Provisioning issues with RKE1 internal team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@SheilaghM
Copy link

SURE-3541

Rancher Server Setup

  • Rancher version:
  • Installation option (Docker install/Helm Chart):
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version:
  • Cluster Type (Local/Downstream):
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

Issue description:
Customer uses Rancher API call to determine the cluster upgrade state.

curl -s -k -u "${CATTLE_ACCESS_KEY}:${CATTLE_SECRET_KEY}" \
-X GET
-H 'Accept: application/json'
-H 'Content-Type: application/json'
https://RANCHER_FQDN/v3/clusters/c-xxxx |jq .state
But the state changes to active even before the add-on job completes; especially the critical components like coredns and CNI were not completed. This is not an expected behavior since the timeout add-on jobs will make a partially upgraded cluster. (Affected Cluster ID c-w56h9)

Business impact:
Partially upgraded cluster.

Troubleshooting steps:
They had a situation where the add-on jobs were timed out and none of the add-ons were upgraded.

The old add-on jobs were in removing state in the UI.
They took the below steps to complete the upgrade

Deleted the removing add-on jobs
Triggered a reconciliation by changing the job timeout
Repro steps:

Old kubelet logs were not available, so not sure how the add-on jobs were stuck in removing state
Workaround:
Is a workaround available and implemented? No
What is the workaround:

Actual behavior:
Cluster state hanged to "Active" even before add-on job completes

Expected behavior:
Cluster state should stay in the "Upgrading" state until all add-on jobs are completed.

Files, logs, traces:

Additional notes:
Logs are attached to SURE ticket.

@SheilaghM SheilaghM added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Dec 4, 2021
@snasovich snasovich added this to the v2.6.4 - Triaged milestone Dec 21, 2021
@zube zube bot removed the [zube]: Next Up label Dec 29, 2021
@snasovich
Copy link
Collaborator

snasovich commented Jan 4, 2022

The issue happens when the previous *-deploy-job cannot be deleted for some reason during the upgrade process. A simple way to reproduce it is to manually put a dummy finalizer on some/each of these jobs.

Also, it's important to note that addon deploy jobs are created during the upgrade and NOT cleaned up during the same upgrade, but are kept. Then, when the next upgrade is executed, these jobs are supposed to be deleted prior to creating the ones needed for this new upgrade (with the same name).

So, in the case jobs cannot get deleted for some reason, the first pass of upgrade gets stuck when attempting to delete the first of these jobs and eventually move on to the second attempt to upgrade. RKE log would indicate something like this:

10:24:21 pm | [INFO ] [network] Setting up network plugin: canal
10:24:21 pm | [INFO ] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
10:24:22 pm | [INFO ] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
10:24:22 pm | [INFO ] [addons] Executing deploy job rke-network-plugin
10:26:29 pm | [INFO ] Initiating Kubernetes cluster

At the next attempt to upgrade that same job would appear to go through and it gets stuck on the next one (however in reality this first job would not run).

10:26:49 pm | [INFO ] [network] Setting up network plugin: canal
10:26:49 pm | [INFO ] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
10:26:50 pm | [INFO ] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
10:26:50 pm | [INFO ] [addons] Executing deploy job rke-network-plugin
10:26:50 pm | [INFO ] [addons] Setting up coredns
10:26:50 pm | [INFO ] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes
10:26:50 pm | [INFO ] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes
10:26:50 pm | [INFO ] [addons] Executing deploy job rke-coredns-addon
10:29:27 pm | [INFO ] Initiating Kubernetes cluster

After enough iterations to go through all the jobs like this, the k8s upgrade succeeds. However, none of the jobs actually complete so addons are not upgraded.

It looks like the code below doesn't try to delete an existing job already processed on previous run (likely addonUpdated ends up being false), so when the following lines try to create a new job, we get to "already exists" case and simply proceed assuming it was properly created before (but in fact we still have a leftover from previous upgrade / initial provisioning)
https://github.com/rancher/rke/blob/54dc689ac040076881a5b4af50326d5d15efb2c6/k8s/job.go#L39-L51

Thank you @kinarashah @jakefhyde for working on investigating this!

@snasovich
Copy link
Collaborator

/backport v2.5.12

@slickwarren
Copy link
Contributor

also reproduced this on 2.5.9 with the following steps:

  • deploy a cluster on k8s 1.19.15-rancher1-1
  • on the coredns job after the cluster gets to an active state, edit the job and add a dummy finalizer
    • `kubectl edit -n kube-system job
    • add finalizer:\n - dummy/qafinalizer under metadata
    • validate that the finalizer was added
  • upgrade the cluster to k8s 1.20.xx
  • observe that the coredns job fails to upgrade (in system project -> kube-system namespace)
  • observe that in spite of the failed job upgrade, the cluster still gets to an active state

@jakefhyde
Copy link
Contributor

Root cause

If a job could not be successfully deleted, on first pass rke would fail to delete it. On subsequent passes, if the job was still being deleted and the old job was completed the new job would be applied, however the deleting job would be in a read only state, and continue using the previous job's spec.

What was fixed, or what changes have occurred

Wait for successful removal of the previous job before creating a new one.

Areas or cases that should be tested

  • All possible addon upgrades
  • Lowering the addon_job_timeout within the cluster spec to 0 should eventually reconcile, although the cluster may temporarily enter an error state assuming the addons can be removed successfully.
  • User addons

What areas could experience regressions?

N/A

Are the repro steps accurate/minimal?

Addon job removal failure can be tested by first provisioning a cluster, applying a finalizer to the job, and then attempting to upgrade the kubernetes version (assuming the job is different, in the case of coredns it always is).

The following command can be used to add the finalizer to the coredns job from within a kubectl shell in the downstream cluster:

kubectl patch job rke-coredns-addon-deploy-job -n kube-system -p '{"metadata":{"finalizers":["cattle.io/do-not-delete"]}}' --type=merge

@sgapanovich
Copy link

Test Environment:

Rancher version: v2.6-head 1623a55
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: single node ec2 node driver
Downstream K8s version:
-on 2.6-head (fresh install)
---v1.18.20 upgraded to v1.19.16 (with coredns addon)
---v1.20.15 upgraded to v1.21.9 (with network addon and coredns addon)
-on 2.6-head (upgrade from 2.6.3)
---v1.18.16 upgraded to v1.19.16 (with network addon and coredns addon)


Testing:

  1. update coredns job (addon): add finalizers: cattle.io/do-not-delete to the metadata (you can do it by kc edit job rke-coredns-addon-deploy-job -n kube-system)
  2. upgrade cluster
  3. at some point kc get events shows that everything is ready
  4. cluster goes into error status and keeps keeps trying to delete the addon
  5. user can find all errors in rancher logs (same ones from UI)
    coredns
  6. cluster stays in the error state (switching between "error" and "updating" when trying to reconcile)
  7. remove finalizer data from the job yaml file
  8. depending how soon the yaml file was updated the job is deleted and a new one is created (cluster tried to reconcile every few minutes at first and then every 10 minutes)
  9. cluster goes into active status

steps

@snasovich
Copy link
Collaborator

/forwardport v2.6.3-patch2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provisioning-rke1 Provisioning issues with RKE1 internal team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

6 participants