Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] using 1.27.14 after upgrading rancher is not consistently working for rke2/k3s clusters #45704

Closed
slickwarren opened this issue Jun 6, 2024 · 1 comment
Labels
area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework area/rke2 RKE2-related Issues kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/release-blocker team/rke2

Comments

@slickwarren
Copy link
Contributor

slickwarren commented Jun 6, 2024

Rancher Server Setup

  • Rancher version: latest security version
  • Installation option (Docker install/Helm Chart): helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): rke1
  • Proxy/Cert Details: valid letsencrypt certs

Information about the Cluster

  • Kubernetes version: v1.27.14
  • Cluster Type (Local/Downstream): downstream, rke2 / k3s
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): aws provisioned

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    tested with standard and admin user, cluster owner

Describe the bug

doing release testing and I'm seeing an issue when upgrading from 1.27.14 on rke2/k3s, but only for clusters that were existing before upgrading rancher from 2.8.4 -> security test version .

To Reproduce

  • provision rancher v2.8.4
    • setup downstream rke2 / k3s clusters on 1.26 or 1.27
  • upgrade rancher
    • wait for any downstream upgrades to complete
  • upgrade k8s versions of downstream clusters to 1.27.14
    • wait for cluster to finish upgrading

Result

  • one cluster (1.27.13 -> 1.27.14 rke2) done via automation is a fresh 1.27.14 cluster. 1/3 etcd nodes got stuck in unavailable. I wasn't able to view anything on the cluster in this state, so I deleted the node and was waiting for a new one to register, but is now stuck in a new state waiting for probes: etcd.
    • I can now view the cluster, and looking at the cattle-agent logs I see: level=error msg="Unknown error: Get "https://devrel.shipa.io/hs-fs/hubfs/SHIPA2-large.png\": remote error: tls: handshake failure" along with some other errors, all the latest ones seem related to the issue with 502 errors from the charts repo that I thought was resolved yesterday?
    • I think it stopped posting logs after that, so there's nothing new in cattle-agent since last night even after trying to delete the 'stuck' node and add a new one
    • note: Another fresh install of the same type of cluster was successful
  • another cluster (1.26.15 -> 1.27.14 k3s) I upgraded manually on a cluster that was provisioned on rancher 2.8.4 and survived the rancher upgrade to security version:
    • upon upgrading the k8s version , the nodes were deleted one at a time from what I could tell until all nodes were replaced. But some nodes got stuck in an unavailable state..
      has a control plane node stuck in unavailable along with 1 worker, so they aren't deleting and the cluster seems stuck in this state. The cattle-cluster-agent pods aren't in the same state as the other cluster, but one of the pods is stuck terminating. metrics-server pod is stuck terminating, which I think is just because its trying to scrape from the unavailable nodes. So I haven't figured out why they are not being removed when they should have been.

Expected Result

clusters should be able to upgrade after rancher is upgraded

Screenshots

Additional context

  • rke1 seemed to upgrade to this version fine
  • custom clusters seemed to upgrade fine for k3s/rke2
  • a new cluster created after the upgrade of rancher seemed to upgrade fine using rke2 (1.27.13 -> 1.27.14)
  • a new cluster created on 1.27.14 also seems to work fine
@slickwarren slickwarren added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement area/rke2 RKE2-related Issues area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework team/rke2 status/release-blocker labels Jun 6, 2024
@slickwarren
Copy link
Contributor Author

this turned out to be an issue with automation - closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework area/rke2 RKE2-related Issues kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/release-blocker team/rke2
Projects
None yet
Development

No branches or pull requests

1 participant