Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929

daleckystepan · 2024-03-26T14:09:00Z

Update 3/27/2024

The issue should no longer be reproducible, see #44929 (comment) for more information.

Rancher Server Setup

Rancher version: 2.8.1 and also 2.8.2
Installation option (Docker install/Helm Chart): Docker
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.27.11+rke2r1
Proxy/Cert Details: No proxy, self-signed certs and also Let's Encrypt

Information about the Cluster

Kubernetes version: v1.27.11+rke2r1
Cluster Type (Local/Downstream): v1.27.11+rke2r1, vSphere node provider
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions: Admin

Describe the bug
All pools are unexpectedly unavailable on more than 30 clusters in different configuration, DCs, different VMware versions etc. On the most clusters all pods inside local cluster have been restarted during today's night.

A lot of similar errors in local cluster capi-controller-manager-bf5847f5b-n8t89

2024-03-26T13:49:09.985633257Z E0326 13:49:09.984995       1 machineset_controller.go:883] "Unable to retrieve Node status" err="failed to create cluster accessor: error fetching REST client config for remote cluster \"fleet-default/app\": failed to retrieve kubeconfig secret for Cluster fleet-default/app: Secret \"app-kubeconfig\" not found" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="fleet-default/app-worker-dev-764b7759f8x5dkwx" namespace="fleet-default" name="app-worker-dev-764b7759f8x5dkwx" reconcileID=d40f0d80-9fe6-43b4-aa67-9bd9ec9c5756 MachineDeployment="fleet-default/app-worker-dev" Cluster="fleet-default/app" Machine="fleet-default/app-worker-dev-764b7759f8x5dkwx-n8wx7" node=""

I checked app-kubeconfig secret and it is present and seems to have valid content.

To Reproduce
I don't know

Result
All pools unavailable, cluster maintenance not possible.

Expected Result
All pools available

Screenshots

Additional context

Thank you for support

The text was updated successfully, but these errors were encountered:

nickvth · 2024-03-26T14:17:55Z

same here

bagutzu · 2024-03-26T14:30:25Z

+1

nickvth · 2024-03-26T14:31:48Z

Look like it's broken after upgrade to https://github.com/rancher/charts/tree/release-v2.8/charts/rancher-provisioning-capi/103.1.0%2Bup0.1.0

fplantinga-guida · 2024-03-26T14:48:43Z

Same here

hansbogert · 2024-03-26T14:49:17Z

Setting the capi-controller-manager to version 1.4.4 (was 1.5.5), lets me deploy new clusters with correct health status. Deploying new clusters was not possible anymore since roughly 10 hours.

daleckystepan · 2024-03-26T14:50:06Z

I also tried to downgrade but it is quickly updated again.

m4rCsi · 2024-03-26T14:52:29Z

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:

Going into the Apps - Repositories Section
Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd
Going to Installed Apps
Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )

Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

richardcase · 2024-03-26T15:17:29Z

@daleckystepan (or anyone else seeing this issue) - could you look at the secret that contains the kubeconfig for one of the clusters and see what labels it has? I'd be interested if there is one called cluster.x-k8s.io/cluster-name.

daleckystepan · 2024-03-26T15:20:12Z

@richardcase no labels at all on secret app-kubeconfig

nickvth · 2024-03-26T15:21:53Z

No label cluster.x-k8s.io/cluster-name @richardcase

kg secret donald-prod-1-kubeconfig  -o yaml
apiVersion: v1
data:
  token: *****
  value: *****
kind: Secret
metadata:
  creationTimestamp: "2024-02-02T09:42:29Z"
  name: donald-prod-1-kubeconfig
  namespace: fleet-default
  ownerReferences:
  - apiVersion: provisioning.cattle.io/v1
    kind: Cluster
    name: donald-prod-1
    uid: a9e0e6f8-ba10-4a5d-9ed5-ded4077631dd
  resourceVersion: "142716571"
  uid: 5aacaca0-1ea7-4849-a055-cced99c75d4c
type: Opaque

richardcase · 2024-03-26T16:01:12Z

Thanks @nickvth & @daleckystepan . The lack of labels appears to be the issue. Just to let you know we are looking at how to resolve this.

daleckystepan · 2024-03-26T16:06:15Z

I added label manually and it seems to be working.

Oats87 · 2024-03-26T16:34:32Z

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:

Going into the Apps - Repositories Section

Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd

Going to Installed Apps

Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )

Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

This is the most effective way as of now that I can think of to pin the version of chart so that it does not get inadvertently upgraded.

I added label manually and it seems to be working.

Adding the label manually to the kubeconfig secrets is also a solution in the short term, but if Rancher deems that the kubeconfig is invalid i.e. token is no longer valid or the server-url is changed etc, it will recreate that kubeconfig secret sans the label.

josh383451 · 2024-03-26T17:50:03Z

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:
* Going into the Apps - Repositories Section

* Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd

* Going to Installed Apps

* Downgrading `rancher-provisioning-capi`.  ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )
Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

Can confirm this is working for new cluster provisioning with AWS EC2

zackbradys · 2024-03-26T17:58:26Z

I can confirm that the above fix worked for existing clusters and provisioning new clusters.

Additionally, an alternative fix would be to redeploy rancher with useBundledSystemChart: true, which will redeploy the capi-controller-manager and any other related resources. I haven’t tried manually labeling the cluster, but others stated earlier that it worked as well.

atsai1220 · 2024-03-26T18:00:25Z

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

nickvth · 2024-03-26T18:01:59Z

Workaround deploy rancher with useBundledSystemChart=true, maybe always the recommended way if you don't want that every merge/push to release-v2.8 git branch will update your cluster.

Configure Rancher server to use the packaged copy of Helm system charts. The system charts repository contains all the catalog items required for features such as monitoring, logging, alerting and global DNS. These Helm charts are located in GitHub, but since you are in an air gapped environment, using the charts that are bundled within Rancher is much easier than setting up a Git mirror.

After that:

Going to Installed Apps
Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )
No new version available

nickvth · 2024-03-26T18:06:17Z

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

Thanks for sharing, but 2.8.3 is not released. So why propaged this change.

atsai1220 · 2024-03-26T18:56:14Z

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

Thanks for sharing, but 2.8.3 is not released. So why propaged this change.

I have the same question. We will look to useBundledSystemChart=true in the future to prevent surprises.

pvlkov · 2024-03-26T20:38:07Z

Thank you for providing a quick fix. We will use the label workaround for now and upgrade to 2.8.3 as soon as it's out.

kingnarmer · 2024-03-26T21:41:35Z

Unfortunately both workarounds didn't work for me.

I downgraded rancher-provisioning-capi to from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 from rancher gui--> apps --> installed apps . It was fine for few minutes then came back.
Updated useBundledSystemChart=true on existing rancher had no effect.

Appreciate help on how to mitigate .

snasovich · 2024-03-26T22:10:35Z

@kingnarmer , please use the workaround from #44929 (comment) as it pins the charts repo to commit before this problematic updated chart was released.

As an overall update, we're working on releasing new version of chart that essentially rolls back CAPI version upgrade which should address the issue. This is currently undergoing QA process and rancher/charts#3700 is the PR to release this fixed chart.

sulaimantok · 2024-03-27T04:26:18Z

Same here, use this workaround also work for me workaround

daleckystepan · 2024-03-27T08:15:33Z

I tried to change CATTLE_SYSTEM_CATALOG to bundled but it has probably no effect if Rancher is already installed. Any other way to prevent those online updates and make it more transparent and managabale for us?

avthart · 2024-03-27T11:10:11Z

I tried to change CATTLE_SYSTEM_CATALOG to bundled but it has probably no effect if Rancher is already installed. Any other way to prevent those online updates and make it more transparent and managabale for us?

Either wait for the fix or you can manually rollback rancher-provisioning-capi using Rancher Apps.

qhris · 2024-03-27T11:14:37Z

The capi upgrade also makes it impossible to provision new clusters because of the same reason.

We verified that the workaround with setting the labels works on the kubeconfig secret.
Running rancher 2.8.3-rc6 was something we also tested that works.

Denys-Janrain-L · 2024-03-27T13:50:08Z

My list of installed apps is always empty, don't know why.
So only first two steps of workaround worked for me, after that I had to open rancher local console ( through browser ) and:
helm -n cattle-provisioning-capi-system rollback rancher-provisioning-capi 1
which rolled back it to 103.0.0+up0.0.1

romarioschneider · 2024-03-27T14:40:32Z

same issue

snasovich · 2024-03-27T14:55:37Z

QA validations were completed and rancher/charts#3700 has just been merged meaning rancher-provisioning-capi version 103.2.0+up0.0.1 is now released and all default-configured non-airgap v2.8.0-v2.8.2 Rancher deployments will automatically update to this fixed version of chart and the issue should be fixed.

Any users that applied the workaround to downgrade the chart to 103.0.0+up0.0.1 (e.g. by following instructions in #44929 (comment)) should now be free to rollback the workaround.

Important Note: Rancher automatically refreshes chart data every 6 hours. To force immediate refresh, please follow these steps:

Select local cluster
Open "Apps" -> "Repositories"
Locate and check Rancher from the list of displayed repos
Select "Refresh" and wait for repo status to update to "Active"
rancher-provisioning-capi will upgrade to version 103.2.0+up0.0.1 shortly

Keeping the issue open for some time to ensure the fix works for all affected users.

josh383451 · 2024-03-27T15:08:59Z

QA validations were completed and rancher/charts#3700 has just been merged meaning rancher-provisioning-capi version 103.2.0+up0.0.1 is now released and all default-configured non-airgap v2.8.0-v2.8.2 Rancher deployments will automatically update to this fixed version of chart and the issue should be fixed.

Any users that applied the workaround to downgrade the chart to 103.0.0+up0.0.1 (e.g. by following instructions in #44929 (comment)) should now be free to rollback the workaround.

Keeping the issue open for some time to ensure the fix works for all affected users.

Can confirm this is working with provisioning AWS EC2 instances using 103.2.0+up0.0.1

dylanthepodman · 2024-03-27T15:47:25Z

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

snasovich · 2024-03-27T16:37:06Z

@dylanthepodman , thank you for reporting this. Most likely you're running into some different issue. Was provisioning working OK on the same setup earlier?

Oats87 · 2024-03-27T16:42:31Z

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

Your screenshot shows that your cluster is waiting for a worker node to be registered (and on top of that, your cluster does not have a worker listed in your machine list.

dylanthepodman · 2024-03-27T17:25:57Z

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

Your screenshot shows that your cluster is waiting for a worker node to be registered (and on top of that, your cluster does not have a worker listed in your machine list.

I tried, and it is working now. Thank you for mentioning this to me.

Interestingly enough, I tried this same setup in Rancher v2.6.12 and it did not have this problem.

This is to avoid the issue seen in rancher/rancher#44929. Signed-off-by: Loic Devulder <ldevulder@suse.com>

daleckystepan added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Mar 26, 2024

daleckystepan changed the title ~~[BUG] All pools unavailable since 3:00 GMT on all clusters~~ [BUG] All pools unavailable on all clusters Mar 26, 2024

snasovich added status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels Mar 26, 2024

snasovich assigned snasovich and daviswill2 Mar 26, 2024

This was referenced Mar 27, 2024

[BUG] #44910

Open

[BUG] Rancher can no longer provision harvester machines after restart #44912

Open

snasovich changed the title ~~[BUG] All pools unavailable on all clusters~~ Resolved - [BUG] All pools unavailable on all clusters Mar 27, 2024

snasovich pinned this issue Mar 27, 2024

snasovich changed the title ~~Resolved - [BUG] All pools unavailable on all clusters~~ Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 Mar 27, 2024

ldevulder added a commit to ldevulder/ele-testhelpers that referenced this issue Mar 28, 2024

Force use of bundled charts for Rancher deployment

bb4e20c

This is to avoid the issue seen in rancher/rancher#44929. Signed-off-by: Loic Devulder <ldevulder@suse.com>

ldevulder mentioned this issue Mar 28, 2024

Force use of bundled charts for Rancher deployment rancher-sandbox/ele-testhelpers#44

Merged

ldevulder added a commit to rancher-sandbox/ele-testhelpers that referenced this issue Mar 28, 2024

Force use of bundled charts for Rancher deployment

e38cbb7

This is to avoid the issue seen in rancher/rancher#44929. Signed-off-by: Loic Devulder <ldevulder@suse.com>

Priyashetty17 unpinned this issue Mar 28, 2024

pgonin pinned this issue Mar 29, 2024

markusewalker unpinned this issue Apr 3, 2024

snasovich pinned this issue Apr 3, 2024

slickwarren mentioned this issue Apr 5, 2024

Test dev charts on latest currently released version(s) of rancher rancher/qa-tasks#1271

Open

snasovich unpinned this issue Apr 9, 2024

snasovich closed this as completed Apr 10, 2024

snasovich mentioned this issue Apr 10, 2024

Version pinning for rancher-provisioning-capi chart #45099

Closed

This was referenced Apr 23, 2024

[SURE-7311] GitRepo fails with error "namespace not found" when using namespaceLabels: in existing namespaces rancher/fleet#1994

Closed

[BUG] rancher/rancher changes break UI tests #45205

Closed

albinsun mentioned this issue May 17, 2024

[TEST] Consider upgrade Rancher in daily test env. to v2.8.3+ harvester/tests#1271

Closed

snasovich mentioned this issue Jun 28, 2024

[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929

Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929

daleckystepan commented Mar 26, 2024 •

edited by snasovich

Loading

nickvth commented Mar 26, 2024

bagutzu commented Mar 26, 2024

nickvth commented Mar 26, 2024

fplantinga-guida commented Mar 26, 2024

hansbogert commented Mar 26, 2024

daleckystepan commented Mar 26, 2024

m4rCsi commented Mar 26, 2024 •

edited

Loading

richardcase commented Mar 26, 2024

daleckystepan commented Mar 26, 2024 •

edited

Loading

nickvth commented Mar 26, 2024 •

edited

Loading

richardcase commented Mar 26, 2024

daleckystepan commented Mar 26, 2024

Oats87 commented Mar 26, 2024 •

edited

Loading

josh383451 commented Mar 26, 2024

zackbradys commented Mar 26, 2024 •

edited

Loading

atsai1220 commented Mar 26, 2024

nickvth commented Mar 26, 2024 •

edited

Loading

nickvth commented Mar 26, 2024 •

edited

Loading

atsai1220 commented Mar 26, 2024

pvlkov commented Mar 26, 2024

kingnarmer commented Mar 26, 2024 •

edited

Loading

snasovich commented Mar 26, 2024

sulaimantok commented Mar 27, 2024 •

edited

Loading

daleckystepan commented Mar 27, 2024 •

edited

Loading

avthart commented Mar 27, 2024

qhris commented Mar 27, 2024

Denys-Janrain-L commented Mar 27, 2024

romarioschneider commented Mar 27, 2024

snasovich commented Mar 27, 2024 •

edited

Loading

josh383451 commented Mar 27, 2024 •

edited

Loading

dylanthepodman commented Mar 27, 2024 •

edited

Loading

snasovich commented Mar 27, 2024

Oats87 commented Mar 27, 2024

dylanthepodman commented Mar 27, 2024

Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929

Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929

Comments

daleckystepan commented Mar 26, 2024 • edited by snasovich Loading

Update 3/27/2024

nickvth commented Mar 26, 2024

bagutzu commented Mar 26, 2024

nickvth commented Mar 26, 2024

fplantinga-guida commented Mar 26, 2024

hansbogert commented Mar 26, 2024

daleckystepan commented Mar 26, 2024

m4rCsi commented Mar 26, 2024 • edited Loading

richardcase commented Mar 26, 2024

daleckystepan commented Mar 26, 2024 • edited Loading

nickvth commented Mar 26, 2024 • edited Loading

richardcase commented Mar 26, 2024

daleckystepan commented Mar 26, 2024

Oats87 commented Mar 26, 2024 • edited Loading

josh383451 commented Mar 26, 2024

zackbradys commented Mar 26, 2024 • edited Loading

atsai1220 commented Mar 26, 2024

nickvth commented Mar 26, 2024 • edited Loading

nickvth commented Mar 26, 2024 • edited Loading

atsai1220 commented Mar 26, 2024

pvlkov commented Mar 26, 2024

kingnarmer commented Mar 26, 2024 • edited Loading

snasovich commented Mar 26, 2024

sulaimantok commented Mar 27, 2024 • edited Loading

daleckystepan commented Mar 27, 2024 • edited Loading

avthart commented Mar 27, 2024

qhris commented Mar 27, 2024

Denys-Janrain-L commented Mar 27, 2024

romarioschneider commented Mar 27, 2024

snasovich commented Mar 27, 2024 • edited Loading

josh383451 commented Mar 27, 2024 • edited Loading

dylanthepodman commented Mar 27, 2024 • edited Loading

snasovich commented Mar 27, 2024

Oats87 commented Mar 27, 2024

dylanthepodman commented Mar 27, 2024

daleckystepan commented Mar 26, 2024 •

edited by snasovich

Loading

m4rCsi commented Mar 26, 2024 •

edited

Loading

daleckystepan commented Mar 26, 2024 •

edited

Loading

nickvth commented Mar 26, 2024 •

edited

Loading

Oats87 commented Mar 26, 2024 •

edited

Loading

zackbradys commented Mar 26, 2024 •

edited

Loading

nickvth commented Mar 26, 2024 •

edited

Loading

nickvth commented Mar 26, 2024 •

edited

Loading

kingnarmer commented Mar 26, 2024 •

edited

Loading

sulaimantok commented Mar 27, 2024 •

edited

Loading

daleckystepan commented Mar 27, 2024 •

edited

Loading

snasovich commented Mar 27, 2024 •

edited

Loading

josh383451 commented Mar 27, 2024 •

edited

Loading

dylanthepodman commented Mar 27, 2024 •

edited

Loading