Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #35750

SheilaghM · 2021-12-04T01:47:32Z

SURE-3541

Rancher Server Setup

Rancher version:
Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
Proxy/Cert Details:

Information about the Cluster

Kubernetes version:
Cluster Type (Local/Downstream):
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

Issue description:
Customer uses Rancher API call to determine the cluster upgrade state.

curl -s -k -u "${CATTLE_ACCESS_KEY}:${CATTLE_SECRET_KEY}" \
-X GET
-H 'Accept: application/json'
-H 'Content-Type: application/json'
https://RANCHER_FQDN/v3/clusters/c-xxxx |jq .state
But the state changes to active even before the add-on job completes; especially the critical components like coredns and CNI were not completed. This is not an expected behavior since the timeout add-on jobs will make a partially upgraded cluster. (Affected Cluster ID c-w56h9)

Business impact:
Partially upgraded cluster.

Troubleshooting steps:
They had a situation where the add-on jobs were timed out and none of the add-ons were upgraded.

The old add-on jobs were in removing state in the UI.
They took the below steps to complete the upgrade

Deleted the removing add-on jobs
Triggered a reconciliation by changing the job timeout
Repro steps:

Old kubelet logs were not available, so not sure how the add-on jobs were stuck in removing state
Workaround:
Is a workaround available and implemented? No
What is the workaround:

Actual behavior:
Cluster state hanged to "Active" even before add-on job completes

Expected behavior:
Cluster state should stay in the "Upgrading" state until all add-on jobs are completed.

Files, logs, traces:

Additional notes:
Logs are attached to SURE ticket.

snasovich · 2022-01-04T03:41:26Z

The issue happens when the previous *-deploy-job cannot be deleted for some reason during the upgrade process. A simple way to reproduce it is to manually put a dummy finalizer on some/each of these jobs.

Also, it's important to note that addon deploy jobs are created during the upgrade and NOT cleaned up during the same upgrade, but are kept. Then, when the next upgrade is executed, these jobs are supposed to be deleted prior to creating the ones needed for this new upgrade (with the same name).

So, in the case jobs cannot get deleted for some reason, the first pass of upgrade gets stuck when attempting to delete the first of these jobs and eventually move on to the second attempt to upgrade. RKE log would indicate something like this:

10:24:21 pm | [INFO ] [network] Setting up network plugin: canal
10:24:21 pm | [INFO ] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
10:24:22 pm | [INFO ] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
10:24:22 pm | [INFO ] [addons] Executing deploy job rke-network-plugin
10:26:29 pm | [INFO ] Initiating Kubernetes cluster

At the next attempt to upgrade that same job would appear to go through and it gets stuck on the next one (however in reality this first job would not run).

10:26:49 pm | [INFO ] [network] Setting up network plugin: canal
10:26:49 pm | [INFO ] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
10:26:50 pm | [INFO ] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
10:26:50 pm | [INFO ] [addons] Executing deploy job rke-network-plugin
10:26:50 pm | [INFO ] [addons] Setting up coredns
10:26:50 pm | [INFO ] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes
10:26:50 pm | [INFO ] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes
10:26:50 pm | [INFO ] [addons] Executing deploy job rke-coredns-addon
10:29:27 pm | [INFO ] Initiating Kubernetes cluster

After enough iterations to go through all the jobs like this, the k8s upgrade succeeds. However, none of the jobs actually complete so addons are not upgraded.

It looks like the code below doesn't try to delete an existing job already processed on previous run (likely addonUpdated ends up being false), so when the following lines try to create a new job, we get to "already exists" case and simply proceed assuming it was properly created before (but in fact we still have a leftover from previous upgrade / initial provisioning)
https://github.com/rancher/rke/blob/54dc689ac040076881a5b4af50326d5d15efb2c6/k8s/job.go#L39-L51

Thank you @kinarashah @jakefhyde for working on investigating this!

snasovich · 2022-01-05T20:25:04Z

/backport v2.5.12

slickwarren · 2022-01-05T20:42:33Z

also reproduced this on 2.5.9 with the following steps:

deploy a cluster on k8s 1.19.15-rancher1-1
on the coredns job after the cluster gets to an active state, edit the job and add a dummy finalizer
- `kubectl edit -n kube-system job
- add finalizer:\n - dummy/qafinalizer under metadata
- validate that the finalizer was added
upgrade the cluster to k8s 1.20.xx
observe that the coredns job fails to upgrade (in system project -> kube-system namespace)
observe that in spite of the failed job upgrade, the cluster still gets to an active state

jakefhyde · 2022-01-10T20:29:25Z

Root cause

If a job could not be successfully deleted, on first pass rke would fail to delete it. On subsequent passes, if the job was still being deleted and the old job was completed the new job would be applied, however the deleting job would be in a read only state, and continue using the previous job's spec.

What was fixed, or what changes have occurred

Wait for successful removal of the previous job before creating a new one.

Areas or cases that should be tested

All possible addon upgrades
Lowering the addon_job_timeout within the cluster spec to 0 should eventually reconcile, although the cluster may temporarily enter an error state assuming the addons can be removed successfully.
User addons

What areas could experience regressions?

N/A

Are the repro steps accurate/minimal?

Addon job removal failure can be tested by first provisioning a cluster, applying a finalizer to the job, and then attempting to upgrade the kubernetes version (assuming the job is different, in the case of coredns it always is).

The following command can be used to add the finalizer to the coredns job from within a kubectl shell in the downstream cluster:

kubectl patch job rke-coredns-addon-deploy-job -n kube-system -p '{"metadata":{"finalizers":["cattle.io/do-not-delete"]}}' --type=merge

sgapanovich · 2022-02-03T21:57:35Z

Test Environment:

Rancher version: v2.6-head 1623a55
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: single node ec2 node driver
Downstream K8s version:
-on 2.6-head (fresh install)
---v1.18.20 upgraded to v1.19.16 (with coredns addon)
---v1.20.15 upgraded to v1.21.9 (with network addon and coredns addon)
-on 2.6-head (upgrade from 2.6.3)
---v1.18.16 upgraded to v1.19.16 (with network addon and coredns addon)

Testing:

update coredns job (addon): add finalizers: cattle.io/do-not-delete to the metadata (you can do it by kc edit job rke-coredns-addon-deploy-job -n kube-system)
upgrade cluster
at some point kc get events shows that everything is ready
cluster goes into error status and keeps keeps trying to delete the addon
user can find all errors in rancher logs (same ones from UI)
cluster stays in the error state (switching between "error" and "updating" when trying to reconcile)
remove finalizer data from the job yaml file
depending how soon the yaml file was updated the job is deleted and a new one is created (cluster tried to reconcile every few minutes at first and then every 10 minutes)
cluster goes into active status

snasovich · 2022-02-25T03:23:59Z

/forwardport v2.6.3-patch2

SheilaghM added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Dec 4, 2021

snasovich added internal [zube]: To Triage area/provisioning-rke1 Provisioning issues with RKE1 labels Dec 21, 2021

snasovich added this to the v2.6.4 - Triaged milestone Dec 21, 2021

snasovich added [zube]: Team Area 2 and removed [zube]: To Triage labels Dec 21, 2021

snasovich added the [zube]: Next Up label Dec 29, 2021

zube bot removed the [zube]: Team Area 2 label Dec 29, 2021

snasovich assigned jakefhyde Dec 29, 2021

jakefhyde added the [zube]: Working label Dec 29, 2021

zube bot removed the [zube]: Next Up label Dec 29, 2021

Sahota1225 modified the milestones: v2.6.4 - Triaged, v2.5.12 Jan 4, 2022

sowmyav27 assigned slickwarren Jan 4, 2022

Sahota1225 modified the milestones: v2.5.12, v2.6.4 - Triaged Jan 4, 2022

This was referenced Jan 4, 2022

Retry query job status when addon is removing during apply rancher/rke#2796

Merged

[Backport v1.2] addon jobs stuck removing rancher/rke#2797

Closed

rancherbot mentioned this issue Jan 5, 2022

[Backport v2.5] Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #36037

Closed

jakefhyde added [zube]: Review and removed [zube]: Working labels Jan 6, 2022

This was referenced Jan 7, 2022

Bump rke for fix where addon jobs are stuck in removing #36076

Merged

[Backport v2.5] Bump rke for fix where addon jobs are stuck in removing #36077

Merged

jakefhyde added the [zube]: To Test label Jan 10, 2022

zube bot removed the [zube]: Review label Jan 10, 2022

sowmyav27 assigned sgapanovich Jan 13, 2022

slickwarren removed their assignment Feb 3, 2022

sgapanovich added [zube]: QA Working and removed [zube]: To Test labels Feb 3, 2022

sgapanovich closed this as completed Feb 3, 2022

sgapanovich added [zube]: Done and removed [zube]: QA Working labels Feb 3, 2022

rancherbot mentioned this issue Feb 25, 2022

[Forwardport v2.6.3-patch2] Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #36642

Closed

zube bot removed the [zube]: Done label May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #35750

Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #35750

SheilaghM commented Dec 4, 2021

snasovich commented Jan 4, 2022 •

edited

Loading

snasovich commented Jan 5, 2022

slickwarren commented Jan 5, 2022

jakefhyde commented Jan 10, 2022

sgapanovich commented Feb 3, 2022

snasovich commented Feb 25, 2022

Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #35750

Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #35750

Comments

SheilaghM commented Dec 4, 2021

snasovich commented Jan 4, 2022 • edited Loading

snasovich commented Jan 5, 2022

slickwarren commented Jan 5, 2022

jakefhyde commented Jan 10, 2022

Root cause

What was fixed, or what changes have occurred

Areas or cases that should be tested

What areas could experience regressions?

Are the repro steps accurate/minimal?

sgapanovich commented Feb 3, 2022

Test Environment:

Testing:

snasovich commented Feb 25, 2022

snasovich commented Jan 4, 2022 •

edited

Loading