-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rancher reports cluster as active even though the add-on's are not upgraded as part of k8s version upgrade #35750
Comments
The issue happens when the previous Also, it's important to note that addon deploy jobs are created during the upgrade and NOT cleaned up during the same upgrade, but are kept. Then, when the next upgrade is executed, these jobs are supposed to be deleted prior to creating the ones needed for this new upgrade (with the same name). So, in the case jobs cannot get deleted for some reason, the first pass of upgrade gets stuck when attempting to delete the first of these jobs and eventually move on to the second attempt to upgrade. RKE log would indicate something like this:
At the next attempt to upgrade that same job would appear to go through and it gets stuck on the next one (however in reality this first job would not run).
After enough iterations to go through all the jobs like this, the k8s upgrade succeeds. However, none of the jobs actually complete so addons are not upgraded. It looks like the code below doesn't try to delete an existing job already processed on previous run (likely Thank you @kinarashah @jakefhyde for working on investigating this! |
/backport v2.5.12 |
also reproduced this on 2.5.9 with the following steps:
|
Root causeIf a job could not be successfully deleted, on first pass rke would fail to delete it. On subsequent passes, if the job was still being deleted and the old job was completed the new job would be applied, however the deleting job would be in a read only state, and continue using the previous job's spec. What was fixed, or what changes have occurredWait for successful removal of the previous job before creating a new one. Areas or cases that should be tested
What areas could experience regressions?N/A Are the repro steps accurate/minimal?Addon job removal failure can be tested by first provisioning a cluster, applying a finalizer to the job, and then attempting to upgrade the kubernetes version (assuming the job is different, in the case of coredns it always is). The following command can be used to add the finalizer to the coredns job from within a kubectl shell in the downstream cluster:
|
Test Environment:Rancher version: v2.6-head 1623a55 Downstream cluster type: single node ec2 node driver Testing:
|
/forwardport v2.6.3-patch2 |
SURE-3541
Rancher Server Setup
Information about the Cluster
Issue description:
Customer uses Rancher API call to determine the cluster upgrade state.
curl -s -k -u "${CATTLE_ACCESS_KEY}:${CATTLE_SECRET_KEY}" \
-X GET
-H 'Accept: application/json'
-H 'Content-Type: application/json'
https://RANCHER_FQDN/v3/clusters/c-xxxx |jq .state
But the state changes to active even before the add-on job completes; especially the critical components like coredns and CNI were not completed. This is not an expected behavior since the timeout add-on jobs will make a partially upgraded cluster. (Affected Cluster ID c-w56h9)
Business impact:
Partially upgraded cluster.
Troubleshooting steps:
They had a situation where the add-on jobs were timed out and none of the add-ons were upgraded.
The old add-on jobs were in removing state in the UI.
They took the below steps to complete the upgrade
Deleted the removing add-on jobs
Triggered a reconciliation by changing the job timeout
Repro steps:
Old kubelet logs were not available, so not sure how the add-on jobs were stuck in removing state
Workaround:
Is a workaround available and implemented? No
What is the workaround:
Actual behavior:
Cluster state hanged to "Active" even before add-on job completes
Expected behavior:
Cluster state should stay in the "Upgrading" state until all add-on jobs are completed.
Files, logs, traces:
Additional notes:
Logs are attached to SURE ticket.
The text was updated successfully, but these errors were encountered: