-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929
Comments
same here |
+1 |
Look like it's broken after upgrade to https://github.com/rancher/charts/tree/release-v2.8/charts/rancher-provisioning-capi/103.1.0%2Bup0.1.0 |
Same here |
Setting the |
I also tried to downgrade but it is quickly updated again. |
We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue. After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause). We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:
Maybe that will help someone as well to workaround this until a proper fix has been found. If someone knows how to pin it to 1.4.4 in a better way, please let us know :) |
@daleckystepan (or anyone else seeing this issue) - could you look at the secret that contains the kubeconfig for one of the clusters and see what labels it has? I'd be interested if there is one called |
@richardcase no labels at all on secret |
No label cluster.x-k8s.io/cluster-name @richardcase
|
Thanks @nickvth & @daleckystepan . The lack of labels appears to be the issue. Just to let you know we are looking at how to resolve this. |
I added label manually and it seems to be working. |
This is the most effective way as of now that I can think of to pin the version of chart so that it does not get inadvertently upgraded.
Adding the label manually to the kubeconfig secrets is also a solution in the short term, but if Rancher deems that the kubeconfig is invalid i.e. token is no longer valid or the server-url is changed etc, it will recreate that kubeconfig secret sans the label. |
Can confirm this is working for new cluster provisioning with AWS EC2 |
This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other
This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688 |
Workaround deploy rancher with useBundledSystemChart=true, maybe always the recommended way if you don't want that every merge/push to release-v2.8 git branch will update your cluster. Configure Rancher server to use the packaged copy of Helm system charts. The system charts repository contains all the catalog items required for features such as monitoring, logging, alerting and global DNS. These Helm charts are located in GitHub, but since you are in an air gapped environment, using the charts that are bundled within Rancher is much easier than setting up a Git mirror. After that:
|
Thanks for sharing, but 2.8.3 is not released. So why propaged this change. |
I have the same question. We will look to |
Thank you for providing a quick fix. We will use the label workaround for now and upgrade to 2.8.3 as soon as it's out. |
Unfortunately both workarounds didn't work for me.
Appreciate help on how to mitigate . |
@kingnarmer , please use the workaround from #44929 (comment) as it pins the charts repo to commit before this problematic updated chart was released. As an overall update, we're working on releasing new version of chart that essentially rolls back CAPI version upgrade which should address the issue. This is currently undergoing QA process and rancher/charts#3700 is the PR to release this fixed chart. |
Same here, use this workaround also work for me workaround |
I tried to change |
Either wait for the fix or you can manually rollback |
The capi upgrade also makes it impossible to provision new clusters because of the same reason. We verified that the workaround with setting the labels works on the kubeconfig secret. |
My list of installed apps is always empty, don't know why. |
same issue |
QA validations were completed and rancher/charts#3700 has just been merged meaning Any users that applied the workaround to downgrade the chart to Important Note: Rancher automatically refreshes chart data every 6 hours. To force immediate refresh, please follow these steps:
Keeping the issue open for some time to ensure the fix works for all affected users. |
Can confirm this is working with provisioning AWS EC2 instances using 103.2.0+up0.0.1 |
Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well. |
@dylanthepodman , thank you for reporting this. Most likely you're running into some different issue. Was provisioning working OK on the same setup earlier? |
Your screenshot shows that your cluster is waiting for a worker node to be registered (and on top of that, your cluster does not have a worker listed in your machine list. |
This is to avoid the issue seen in rancher/rancher#44929. Signed-off-by: Loic Devulder <ldevulder@suse.com>
This is to avoid the issue seen in rancher/rancher#44929. Signed-off-by: Loic Devulder <ldevulder@suse.com>
Update 3/27/2024
The issue should no longer be reproducible, see #44929 (comment) for more information.
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
All pools are unexpectedly unavailable on more than 30 clusters in different configuration, DCs, different VMware versions etc. On the most clusters all pods inside local cluster have been restarted during today's night.
A lot of similar errors in local cluster
capi-controller-manager-bf5847f5b-n8t89
I checked
app-kubeconfig
secret and it is present and seems to have valid content.To Reproduce
I don't know
Result
All pools unavailable, cluster maintenance not possible.
Expected Result
All pools available
Screenshots
Additional context
Thank you for support
The text was updated successfully, but these errors were encountered: