You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RKE2 cluster (managed by Rancher), when removed from Rancher and re-added (imported), fails to upgrade when it's done via Rancher. It seems that CP/ETCD nodes upgrade successfully (goes one by one), then stucks on the last one and in a few minutes, it ends up with this error:
For search engines: [Disconnected] Cluster agent is not connected
Upgrading a non-imported cluster (the one that was created and managed by Rancher) is upgrading without any issues (as expected), but fails if it's re-imported.
Looking at /usr/local/bin/rke2 --version output of all CP/etcd/master/worker nodes (basically all nodes within RKE2 cluster), they all have identical version after such stuck upgrade: rke2 version v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886) so I assume that upgrade was successful on all nodes, it's just there is some bug within Rancher.
As a result, Rancher's proxy (kubernetes API and so on) doesn't work when cluster is in being upgraded state, even tho cluster is working perfectly fine.
I've also found a temporary "fix" to workaround this issue:
Remove cluster from Rancher.
Import cluster back to Rancher.
However, there is a reason why it's "temporary" - restart all enabled/relevant rke2-server.service/rke2-agent.service services on all RKE2 nodes (or just reboot all them) and cluster is back to [Disconnected] Cluster agent is not connected. 🤷♂️
Note that I was not able to find any relevant log error in Rancher, RKE2 services and/or pods logs. Everything seem to work just fine, no obvious error, so I don't know how to troubleshoot it further...
Things I've tried and they didn't help with this issue:
And register all cluster nodes to Rancher using Rancher-issued commands in WebUI. I have 3 CP/ETCD nodes and 2 worker nodes.
Cluster itself is working super great and no issues with it. Now remove it from Rancher, then Import cluster, add the same name main and commands will be shown. I had to update kubeconfig to be able to connect to RKE2 directly (without using Rancher as proxy), but import is also working perfectly fine.
When cluster is imported and fully "green" in WebUI, go to cluster management and upgrade it to any upper version. It could be latest (shown in Rancher), or the one higher than the existing. I've seen that CP/ETCD nodes upgrade one by one, but getting stuck on last CP/ETCD node and stays there with a message "being upgraded". After a several minutes, the [Disconnected] Cluster agent is not connected error would be shown.
Conclusion
As I've stated above - rke2 binary is exact version of all nodes after upgrade, so I assume the upgrade is successful, but there is some nasty bug in Rancher and I don't think I can do anything other than report this issue here.
Please let me know what additional information would you need in order for me to help with this issue.
The text was updated successfully, but these errors were encountered:
erkexzcx
added
the
kind/bug
Issues that are defects reported by users or that we know have reached a real release
label
May 29, 2024
Description
RKE2 cluster (managed by Rancher), when removed from Rancher and re-added (imported), fails to upgrade when it's done via Rancher. It seems that CP/ETCD nodes upgrade successfully (goes one by one), then stucks on the last one and in a few minutes, it ends up with this error:
For search engines:
[Disconnected] Cluster agent is not connected
Upgrading a non-imported cluster (the one that was created and managed by Rancher) is upgrading without any issues (as expected), but fails if it's re-imported.
Looking at
/usr/local/bin/rke2 --version
output of all CP/etcd/master/worker nodes (basically all nodes within RKE2 cluster), they all have identical version after such stuck upgrade:rke2 version v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886)
so I assume that upgrade was successful on all nodes, it's just there is some bug within Rancher.As a result, Rancher's proxy (kubernetes API and so on) doesn't work when cluster is in
being upgraded
state, even tho cluster is working perfectly fine.I've also found a temporary "fix" to workaround this issue:
However, there is a reason why it's "temporary" - restart all enabled/relevant
rke2-server.service
/rke2-agent.service
services on all RKE2 nodes (or just reboot all them) and cluster is back to[Disconnected] Cluster agent is not connected
. 🤷♂️Note that I was not able to find any relevant log error in Rancher, RKE2 services and/or pods logs. Everything seem to work just fine, no obvious error, so I don't know how to troubleshoot it further...
Things I've tried and they didn't help with this issue:
legacy
feature flag in Rancher WebUI. Even rebuilt cluster and done all this and still the same outcome.Re-production steps
Install Rancher. Myself I have Rancher
v2.8.4
installed on 3-nodes multi-master K3S clusterv1.27.7+k3s2
via helm.In Rancher, create new cluster definition of
generic
cluster, using this YAML:And register all cluster nodes to Rancher using Rancher-issued commands in WebUI. I have 3 CP/ETCD nodes and 2 worker nodes.
Cluster itself is working super great and no issues with it. Now remove it from Rancher, then
Import
cluster, add the same namemain
and commands will be shown. I had to update kubeconfig to be able to connect to RKE2 directly (without using Rancher as proxy), but import is also working perfectly fine.When cluster is imported and fully "green" in WebUI, go to cluster management and upgrade it to any upper version. It could be latest (shown in Rancher), or the one higher than the existing. I've seen that CP/ETCD nodes upgrade one by one, but getting stuck on last CP/ETCD node and stays there with a message "being upgraded". After a several minutes, the
[Disconnected] Cluster agent is not connected
error would be shown.Conclusion
As I've stated above -
rke2
binary is exact version of all nodes after upgrade, so I assume the upgrade is successful, but there is some nasty bug in Rancher and I don't think I can do anything other than report this issue here.Please let me know what additional information would you need in order for me to help with this issue.
The text was updated successfully, but these errors were encountered: