Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from k8s 1.23.14 to 1.23.15 fails in Rancher 2.6.9 #40280

Closed
wargamez opened this issue Jan 24, 2023 · 17 comments
Closed

Upgrading from k8s 1.23.14 to 1.23.15 fails in Rancher 2.6.9 #40280

wargamez opened this issue Jan 24, 2023 · 17 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support team/infracloud
Milestone

Comments

@wargamez
Copy link

wargamez commented Jan 24, 2023

In Cluster manager I tried to upgrade to kubernetes 1.23.15 from 1.23.14 but the process ends in upgrade failed and error message , see screenshots. However when it says fail I hit the edit cluster again and to my surprise 1.23.14 is still a selectable option. If I choose that, and hit upgrade(downgrade) again, the cluster becomes green again. However the cluster info now says 1.23.15 and all nodes show 1.23.15. Is this a known bug?

rancherprobl2

rancherprobl1

image

rancherprobl3

@wargamez wargamez added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Jan 24, 2023
@wargamez
Copy link
Author

wargamez commented Jan 24, 2023

Seems worker nodes are stuck at 1.23.14 but control-plane and etcd nodes are upgraded to 1.23.15...

@gestgithub
Copy link

gestgithub commented Jan 30, 2023

I can confirm I do also have the same issue. Tried several times with both Rancher 2.6.9 and 2.7.0 to upgrade to 1.23.15 as well as 1.24.9.

It seems that under "Workload" --> "Jobs" in the namespace kube-system you can see that the Job rke-network-plugin-deploy-job keeps failing with messages mentioned in this issue: projectcalico/calico#6258

However not all of our clusters have this issue. All of them but one did the upgrade fine.

@mat1010
Copy link

mat1010 commented Jan 31, 2023

We can confirm the behaviour while upgrading from 1.24.4 to 1.24.9 with rancher 2.7.1

@jimliming
Copy link

confirmed editing these crd's allowed the update to complete successfully /w rke 1.4.2 & k8s 1.21.5 => 1.24.9

rancher/kontainer-driver-metadata@608a5ff

@niusmallnan
Copy link
Contributor

I am trying to figure out this problem, considering that it may be caused by the upgrade of CNI components, so I simulate the version upgrade by adjusting KDM(https://github.com/rancher/kontainer-driver-metadata).

Rancher: v2.6.10
KDM 1: https://raw.githubusercontent.com/rancher/kontainer-driver-metadata/08-30-2022/data/data.json
KDM 2: https://releases.rancher.com/kontainer-driver-metadata/release-v2.6/data.json

Case From To Result
A KDM 1
RKE 1.24.4-rancher1-1
Calico v3.22.0
KDM 2
RKE 1.24.9-rancher1-1
Calico v3.22.5
Success
B KDM 1
RKE 1.22.13-rancher1-1
Calico v3.21.1
KDM 2
RKE 1.22.17-rancher1-1
Calico v3.22.5
Success
C KDM 2
RKE v1.21.14-rancher1-1
Calico v3.19.2
KDM 2
RKE 1.22.17-rancher1-1
Calico v3.22.5
Success
D KDM 2
RKE v1.20.15-rancher2-2
Calico v3.17.2
KDM 2
RKE 1.22.17-rancher1-1
Calico v3.22.5
Success

I haven't found a way to reproduce it yet.

Is the CNI using the default configuration, or has it been changed since the initial installation? Maybe this is a clue?

@rootwuj
Copy link

rootwuj commented Feb 1, 2023

I tried to follow these steps to test but did not reproduce the issue. Cluster can be upgraded.

Rancher:v2.6.9

KDM1: https://raw.githubusercontent.com/rancher/kontainer-driver-metadata/11-28-2022/data/data.json
KDM2: https://releases.rancher.com/kontainer-driver-metadata/release-v2.6/data.json

Steps:

  1. install v2.6.9
  2. update settings -> rke-metadata-config url to KDM1
  3. create rke cluster, k8s version select v1.23.14-rancher1-1, network select Calico, the cluster active.
  4. update settings -> rke-metadata-config url to KDM2
  5. Upgrade the k8s version of the cluster to v1.23.15-rancher1-1,The cluster can be upgraded successfully.

@fengxx
Copy link

fengxx commented Feb 1, 2023

there are RKE1 clusters setup 2 years ago which used v1beta1 CRD, and spec.preserveUnknownFields value is set to true (v1beta1 default to true).
if you use recent k8s version and want to reproduce the issue, just edit CRD and set preserveUnknownFields to true before upgrading

workaround:

kubectl get crd ipamblocks.crd.projectcalico.org -o yaml |sed 's#preserveUnknownFields: true#preserveUnknownFields: false#' |kubectl apply -f -
kubectl get crd felixconfigurations.crd.projectcalico.org -o yaml |sed 's#preserveUnknownFields: true#preserveUnknownFields: false#' |kubectl apply -f -

P.S. upstream already added backwards-compatible fix in projectcalico/calico#6242 but only in 3.24

@niusmallnan
Copy link
Contributor

niusmallnan commented Feb 1, 2023

Using the v1beta1 CRD as a clue, I did this test.

I would check the setting of preserveUnknownFields.
kubectl get crd ipamblocks.crd.projectcalico.org -o yaml | grep preserveUnknownFields

Calico uses v1 CRDs since 3.15.

Rancher: v2.6.10
Upgrade path:

  • RKE v1.18.18-rancher1-2, Calico/Canal v3.13.4, preserveUnknownFields=true, Success
  • RKE v1.19.16-rancher2-1, Calico/Canal v3.16.5, preserveUnknownFields=true, Success
  • RKE v1.20.15-rancher2-1, Calico/Canal v3.17.2, preserveUnknownFields=true, Success
  • RKE v1.21.14-rancher1-1, Calico/Canal v3.19.2, preserveUnknownFields=true, Success
  • RKE v1.22.17-rancher1-1, Calico/Canal v3.22.5, preserveUnknownFields=true, Fail

When the upgrade fails, I can see that the rke-network-plugin-deploy-job fails to run. It shows some error logs:

for: "/etc/config/rke-network-plugin.yaml": customresourcedefinitions.apiextensions.k8s.io "felixconfigurations.crd.projectcalico.org" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
for: "/etc/config/rke-network-plugin.yaml": customresourcedefinitions.apiextensions.k8s.io "ipamblocks.crd.projectcalico.org" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

I run the following command, edit the cluster on the UI, and click the save button to upgrade the cluster successfully.

kubectl get crd ipamblocks.crd.projectcalico.org -o yaml |sed 's#preserveUnknownFields: true#preserveUnknownFields: false#' |kubectl apply -f -
kubectl get crd felixconfigurations.crd.projectcalico.org -o yaml |sed 's#preserveUnknownFields: true#preserveUnknownFields: false#' |kubectl apply -f -
kubectl annotate crd felixconfigurations.crd.projectcalico.org kubectl.kubernetes.io/last-applied-configuration-
kubectl annotate crd ipamblocks.crd.projectcalico.org kubectl.kubernetes.io/last-applied-configuration-

The Calico v3.22.5 version may have some special changes. From the information provided, all RKEs that fail to upgrade use Calico v3.22.5. Never mind, there should be a solution here.

If anyone finds other errors in rke-network-plugin-deploy-job, please provide clues. I can update the workaround.

@fengxx
Copy link

fengxx commented Feb 1, 2023

Calico 3.22 CRD introduced new fields with default values, so have to set preserveUnknownFields=false, please help review the PR rancher/kontainer-driver-metadata#1069

@pkhamre
Copy link

pkhamre commented Feb 2, 2023

I run the following command, edit the cluster on the UI, and click the save button to upgrade the cluster successfully.

kubectl get crd ipamblocks.crd.projectcalico.org -o yaml |sed 's#preserveUnknownFields: true#preserveUnknownFields: false#' |kubectl apply -f -
kubectl get crd felixconfigurations.crd.projectcalico.org -o yaml |sed 's#preserveUnknownFields: true#preserveUnknownFields: false#' |kubectl apply -f -
kubectl annotate crd felixconfigurations.crd.projectcalico.org kubectl.kubernetes.io/last-applied-configuration-
kubectl annotate crd ipamblocks.crd.projectcalico.org kubectl.kubernetes.io/last-applied-configuration-

I can confirm this workaround successfully worked on our Rancher-environment, thanks.

@Sahota1225 Sahota1225 added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Feb 2, 2023
@gregfurman
Copy link

We used the workaround provided by @niusmallnan to go from K8s v1.22.9 and Rancher v2.6.5 -> K8s v1.24.9 Rancher v2.7.1.

@wargamez
Copy link
Author

wargamez commented Feb 4, 2023

I can confirm that this worked aswell. Thank you so much!

@snasovich
Copy link
Collaborator

/forwardport v2.7.2

@jloisel
Copy link

jloisel commented Feb 10, 2023

Same issue here, upgrading a cluster from v1.23.14 to v1.23.15 on Rancher 2.6.10. Can confirm we have a RKE1 cluster from 3 years ago.

@rayandas
Copy link
Contributor

Steps to validate:

  • Setup Rancher v2.6.5 using docker
  • Create a v1.18.20-rancher1 cluster.
  • Upgrade to Rancher v2.6.10 using these steps
  • Upgrade the existing v1.18.20-rancher1 cluster to v1.23.16-rancher1 OR v1.24.10-rancher1

The job rke-network-plugin-deploy-job shouldn't fail and calico/canal pods should be running.

cc: @rishabhmsra

@rishabhmsra
Copy link
Contributor

Validated using below steps:

  • On rancher v2.6.5, provisioned k8s v1.18.20-rancher1-1, ec2 node driver cluster(1-cp, 1-e, 1-w)
  • Upgraded rancher server to version v2.6.10
  • Upgraded existing k8s cluster to version v1.23.16-rancher1-1.
  • Validated the rke-network-plugin-deploy-job and pods.

Result:

  • Cluster upgraded successfully to v1.23.16-rancher1-1 and rke-network-plugin-deploy-job is in Completed state with canal pods up and running:

canal-pods

completed

@rishabhmsra
Copy link
Contributor

rishabhmsra commented Feb 20, 2023

Validated again using below steps:

  • On rancher v2.6.5, provisioned k8s v1.18.20-rancher1-3, ec2 node driver cluster(1-cp, 1-e, 1-w)
  • Upgraded rancher server to v2.6-head(5d91d1a) successfully with KDM pointing to dev-v2.6.
  • Upgraded existing k8s cluster to version : v1.23.16-rancher2-1
  • Validated the rke-network-plugin-deploy-job and pods.

Result :

  • Cluster upgraded successfully to v1.23.16-rancher2-1 and rke-network-plugin-deploy-job is in Completed state with all pods, including canal pods up and running:

canal-pods

rke-pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support team/infracloud
Projects
None yet
Development

No branches or pull requests