[BUG] using 1.27.14 after upgrading rancher is not consistently working for rke2/k3s clusters #45704

slickwarren · 2024-06-06T18:15:18Z

Rancher Server Setup

Rancher version: latest security version
Installation option (Docker install/Helm Chart): helm
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): rke1
Proxy/Cert Details: valid letsencrypt certs

Information about the Cluster

Kubernetes version: v1.27.14
Cluster Type (Local/Downstream): downstream, rke2 / k3s
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): aws provisioned

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
tested with standard and admin user, cluster owner

Describe the bug

doing release testing and I'm seeing an issue when upgrading from 1.27.14 on rke2/k3s, but only for clusters that were existing before upgrading rancher from 2.8.4 -> security test version .

To Reproduce

provision rancher v2.8.4
- setup downstream rke2 / k3s clusters on 1.26 or 1.27
upgrade rancher
- wait for any downstream upgrades to complete
upgrade k8s versions of downstream clusters to 1.27.14
- wait for cluster to finish upgrading

Result

one cluster (1.27.13 -> 1.27.14 rke2) done via automation is a fresh 1.27.14 cluster. 1/3 etcd nodes got stuck in unavailable. I wasn't able to view anything on the cluster in this state, so I deleted the node and was waiting for a new one to register, but is now stuck in a new state waiting for probes: etcd.
- I can now view the cluster, and looking at the cattle-agent logs I see: level=error msg="Unknown error: Get "https://devrel.shipa.io/hs-fs/hubfs/SHIPA2-large.png\": remote error: tls: handshake failure" along with some other errors, all the latest ones seem related to the issue with 502 errors from the charts repo that I thought was resolved yesterday?
- I think it stopped posting logs after that, so there's nothing new in cattle-agent since last night even after trying to delete the 'stuck' node and add a new one
- note: Another fresh install of the same type of cluster was successful
another cluster (1.26.15 -> 1.27.14 k3s) I upgraded manually on a cluster that was provisioned on rancher 2.8.4 and survived the rancher upgrade to security version:
- upon upgrading the k8s version , the nodes were deleted one at a time from what I could tell until all nodes were replaced. But some nodes got stuck in an unavailable state..
  has a control plane node stuck in unavailable along with 1 worker, so they aren't deleting and the cluster seems stuck in this state. The cattle-cluster-agent pods aren't in the same state as the other cluster, but one of the pods is stuck terminating. metrics-server pod is stuck terminating, which I think is just because its trying to scrape from the unavailable nodes. So I haven't figured out why they are not being removed when they should have been.

Expected Result

clusters should be able to upgrade after rancher is upgraded

Screenshots

Additional context

rke1 seemed to upgrade to this version fine
custom clusters seemed to upgrade fine for k3s/rke2
a new cluster created after the upgrade of rancher seemed to upgrade fine using rke2 (1.27.13 -> 1.27.14)
a new cluster created on 1.27.14 also seems to work fine

slickwarren · 2024-06-07T18:22:42Z

this turned out to be an issue with automation - closing this issue.

slickwarren closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] using 1.27.14 after upgrading rancher is not consistently working for rke2/k3s clusters #45704

[BUG] using 1.27.14 after upgrading rancher is not consistently working for rke2/k3s clusters #45704

slickwarren commented Jun 6, 2024 •

edited

Loading

slickwarren commented Jun 7, 2024

[BUG] using 1.27.14 after upgrading rancher is not consistently working for rke2/k3s clusters #45704

[BUG] using 1.27.14 after upgrading rancher is not consistently working for rke2/k3s clusters #45704

Comments

slickwarren commented Jun 6, 2024 • edited Loading

slickwarren commented Jun 7, 2024

slickwarren commented Jun 6, 2024 •

edited

Loading