[BUG] Re-added (removed and imported) RKE2 cluster fails to upgrade #45618

erkexzcx · 2024-05-29T08:09:26Z

Description

RKE2 cluster (managed by Rancher), when removed from Rancher and re-added (imported), fails to upgrade when it's done via Rancher. It seems that CP/ETCD nodes upgrade successfully (goes one by one), then stucks on the last one and in a few minutes, it ends up with this error:

For search engines: [Disconnected] Cluster agent is not connected

Upgrading a non-imported cluster (the one that was created and managed by Rancher) is upgrading without any issues (as expected), but fails if it's re-imported.

Looking at /usr/local/bin/rke2 --version output of all CP/etcd/master/worker nodes (basically all nodes within RKE2 cluster), they all have identical version after such stuck upgrade: rke2 version v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886) so I assume that upgrade was successful on all nodes, it's just there is some bug within Rancher.

As a result, Rancher's proxy (kubernetes API and so on) doesn't work when cluster is in being upgraded state, even tho cluster is working perfectly fine.

I've also found a temporary "fix" to workaround this issue:

Remove cluster from Rancher.
Import cluster back to Rancher.

However, there is a reason why it's "temporary" - restart all enabled/relevant rke2-server.service/rke2-agent.service services on all RKE2 nodes (or just reboot all them) and cluster is back to [Disconnected] Cluster agent is not connected. 🤷‍♂️

Note that I was not able to find any relevant log error in Rancher, RKE2 services and/or pods logs. Everything seem to work just fine, no obvious error, so I don't know how to troubleshoot it further...

Things I've tried and they didn't help with this issue:

Using https://github.com/rancher/rancher-cleanup before re-importing cluster back to Rancher.
Lots of restarts (services, upgrade-related pods, deleting upgrade-related namespaces).
Enabling legacy feature flag in Rancher WebUI. Even rebuilt cluster and done all this and still the same outcome.
Searched GitHub for relevant issues for Rancher/RKE2 and none of them were relevant or had any fix that might work for me.

Re-production steps

Install Rancher. Myself I have Rancher v2.8.4 installed on 3-nodes multi-master K3S cluster v1.27.7+k3s2 via helm.

In Rancher, create new cluster definition of generic cluster, using this YAML:

apiVersion: provisioning.cattle.io/v1
kind: Cluster
metadata:
  name: main
  annotations: {}
  labels: {}
  namespace: fleet-default
spec:
  defaultPodSecurityAdmissionConfigurationTemplateName: rancher-privileged
  enableNetworkPolicy: true
  kubernetesVersion: v1.26.8+rke2r1
  localClusterAuthEndpoint:
    caCerts: ''
    enabled: true
    fqdn: ''
  rkeConfig:
    chartValues:    
      rke2-cilium: {}
    etcd:
      disableSnapshots: false
      snapshotRetention: 5
      snapshotScheduleCron: 0 */5 * * *
    machineGlobalConfig:
      cni: cilium
      disable:
        - rke2-metrics-server
      disable-kube-proxy: false
      etcd-expose-metrics: false
      profile: null
    machineSelectorConfig:
      - config:
          #profile: cis-1.23
          protect-kernel-defaults: true
    registries:
      configs: {}
      mirrors: {}
    upgradeStrategy:
      controlPlaneConcurrency: '1'
      controlPlaneDrainOptions:
        deleteEmptyDirData: true
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: -1
        ignoreDaemonSets: true
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 120
      workerConcurrency: '1'
      workerDrainOptions:
        deleteEmptyDirData: true
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: -1
        ignoreDaemonSets: true
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 120

And register all cluster nodes to Rancher using Rancher-issued commands in WebUI. I have 3 CP/ETCD nodes and 2 worker nodes.

Cluster itself is working super great and no issues with it. Now remove it from Rancher, then Import cluster, add the same name main and commands will be shown. I had to update kubeconfig to be able to connect to RKE2 directly (without using Rancher as proxy), but import is also working perfectly fine.

When cluster is imported and fully "green" in WebUI, go to cluster management and upgrade it to any upper version. It could be latest (shown in Rancher), or the one higher than the existing. I've seen that CP/ETCD nodes upgrade one by one, but getting stuck on last CP/ETCD node and stays there with a message "being upgraded". After a several minutes, the [Disconnected] Cluster agent is not connected error would be shown.

Conclusion

As I've stated above - rke2 binary is exact version of all nodes after upgrade, so I assume the upgrade is successful, but there is some nasty bug in Rancher and I don't think I can do anything other than report this issue here.

Please let me know what additional information would you need in order for me to help with this issue.

The text was updated successfully, but these errors were encountered:

erkexzcx added the kind/bug Issues that are defects reported by users or that we know have reached a real release label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Re-added (removed and imported) RKE2 cluster fails to upgrade #45618

[BUG] Re-added (removed and imported) RKE2 cluster fails to upgrade #45618

erkexzcx commented May 29, 2024

[BUG] Re-added (removed and imported) RKE2 cluster fails to upgrade #45618

[BUG] Re-added (removed and imported) RKE2 cluster fails to upgrade #45618

Comments

erkexzcx commented May 29, 2024

Description

Re-production steps

Conclusion