Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 #44929

Closed
daleckystepan opened this issue Mar 26, 2024 · 34 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support

Comments

@daleckystepan
Copy link

daleckystepan commented Mar 26, 2024

Update 3/27/2024

The issue should no longer be reproducible, see #44929 (comment) for more information.


Rancher Server Setup

  • Rancher version: 2.8.1 and also 2.8.2
  • Installation option (Docker install/Helm Chart): Docker
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.27.11+rke2r1
  • Proxy/Cert Details: No proxy, self-signed certs and also Let's Encrypt

Information about the Cluster

  • Kubernetes version: v1.27.11+rke2r1
  • Cluster Type (Local/Downstream): v1.27.11+rke2r1, vSphere node provider
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: Admin

Describe the bug
All pools are unexpectedly unavailable on more than 30 clusters in different configuration, DCs, different VMware versions etc. On the most clusters all pods inside local cluster have been restarted during today's night.

A lot of similar errors in local cluster capi-controller-manager-bf5847f5b-n8t89

2024-03-26T13:49:09.985633257Z E0326 13:49:09.984995       1 machineset_controller.go:883] "Unable to retrieve Node status" err="failed to create cluster accessor: error fetching REST client config for remote cluster \"fleet-default/app\": failed to retrieve kubeconfig secret for Cluster fleet-default/app: Secret \"app-kubeconfig\" not found" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="fleet-default/app-worker-dev-764b7759f8x5dkwx" namespace="fleet-default" name="app-worker-dev-764b7759f8x5dkwx" reconcileID=d40f0d80-9fe6-43b4-aa67-9bd9ec9c5756 MachineDeployment="fleet-default/app-worker-dev" Cluster="fleet-default/app" Machine="fleet-default/app-worker-dev-764b7759f8x5dkwx-n8wx7" node=""

I checked app-kubeconfig secret and it is present and seems to have valid content.

To Reproduce
I don't know

Result
All pools unavailable, cluster maintenance not possible.

Expected Result
All pools available

Screenshots
Snímek obrazovky 2024-03-26 v 14 08 20

Additional context

Thank you for support

@daleckystepan daleckystepan added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Mar 26, 2024
@nickvth
Copy link

nickvth commented Mar 26, 2024

same here

@bagutzu
Copy link

bagutzu commented Mar 26, 2024

+1

@nickvth
Copy link

nickvth commented Mar 26, 2024

@daleckystepan daleckystepan changed the title [BUG] All pools unavailable since 3:00 GMT on all clusters [BUG] All pools unavailable on all clusters Mar 26, 2024
@fplantinga-guida
Copy link

Same here

@hansbogert
Copy link

Setting the capi-controller-manager to version 1.4.4 (was 1.5.5), lets me deploy new clusters with correct health status. Deploying new clusters was not possible anymore since roughly 10 hours.

@daleckystepan
Copy link
Author

I also tried to downgrade but it is quickly updated again.

@m4rCsi
Copy link

m4rCsi commented Mar 26, 2024

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:

  • Going into the Apps - Repositories Section
  • Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd
  • Going to Installed Apps
  • Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )

Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

@richardcase
Copy link
Contributor

@daleckystepan (or anyone else seeing this issue) - could you look at the secret that contains the kubeconfig for one of the clusters and see what labels it has? I'd be interested if there is one called cluster.x-k8s.io/cluster-name.

@daleckystepan
Copy link
Author

daleckystepan commented Mar 26, 2024

@richardcase no labels at all on secret app-kubeconfig

@nickvth
Copy link

nickvth commented Mar 26, 2024

No label cluster.x-k8s.io/cluster-name @richardcase

kg secret donald-prod-1-kubeconfig  -o yaml
apiVersion: v1
data:
  token: *****
  value: *****
kind: Secret
metadata:
  creationTimestamp: "2024-02-02T09:42:29Z"
  name: donald-prod-1-kubeconfig
  namespace: fleet-default
  ownerReferences:
  - apiVersion: provisioning.cattle.io/v1
    kind: Cluster
    name: donald-prod-1
    uid: a9e0e6f8-ba10-4a5d-9ed5-ded4077631dd
  resourceVersion: "142716571"
  uid: 5aacaca0-1ea7-4849-a055-cced99c75d4c
type: Opaque

@richardcase
Copy link
Contributor

Thanks @nickvth & @daleckystepan . The lack of labels appears to be the issue. Just to let you know we are looking at how to resolve this.

@daleckystepan
Copy link
Author

I added label manually and it seems to be working.

@Oats87
Copy link
Contributor

Oats87 commented Mar 26, 2024

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:

  • Going into the Apps - Repositories Section
  • Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd
  • Going to Installed Apps
  • Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )

Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

This is the most effective way as of now that I can think of to pin the version of chart so that it does not get inadvertently upgraded.

I added label manually and it seems to be working.

Adding the label manually to the kubeconfig secrets is also a solution in the short term, but if Rancher deems that the kubeconfig is invalid i.e. token is no longer valid or the server-url is changed etc, it will recreate that kubeconfig secret sans the label.

@josh383451
Copy link

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:

* Going into the Apps - Repositories Section

* Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd

* Going to Installed Apps

* Downgrading `rancher-provisioning-capi`.  ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )

Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

Can confirm this is working for new cluster provisioning with AWS EC2

@zackbradys
Copy link

zackbradys commented Mar 26, 2024

I can confirm that the above fix worked for existing clusters and provisioning new clusters.

Additionally, an alternative fix would be to redeploy rancher with useBundledSystemChart: true, which will redeploy the capi-controller-manager and any other related resources. I haven’t tried manually labeling the cluster, but others stated earlier that it worked as well.

rancher-cluster-screenshot

@atsai1220
Copy link

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

@nickvth
Copy link

nickvth commented Mar 26, 2024

Workaround deploy rancher with useBundledSystemChart=true, maybe always the recommended way if you don't want that every merge/push to release-v2.8 git branch will update your cluster.

Configure Rancher server to use the packaged copy of Helm system charts. The system charts repository contains all the catalog items required for features such as monitoring, logging, alerting and global DNS. These Helm charts are located in GitHub, but since you are in an air gapped environment, using the charts that are bundled within Rancher is much easier than setting up a Git mirror.

After that:

  • Going to Installed Apps
  • Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )
  • No new version available

@nickvth
Copy link

nickvth commented Mar 26, 2024

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

Thanks for sharing, but 2.8.3 is not released. So why propaged this change.

@atsai1220
Copy link

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

Thanks for sharing, but 2.8.3 is not released. So why propaged this change.

I have the same question. We will look to useBundledSystemChart=true in the future to prevent surprises.

@pvlkov
Copy link

pvlkov commented Mar 26, 2024

Thank you for providing a quick fix. We will use the label workaround for now and upgrade to 2.8.3 as soon as it's out.

@kingnarmer
Copy link

kingnarmer commented Mar 26, 2024

Unfortunately both workarounds didn't work for me.

  • I downgraded rancher-provisioning-capi to from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 from rancher gui--> apps --> installed apps . It was fine for few minutes then came back.

  • Updated useBundledSystemChart=true on existing rancher had no effect.

Appreciate help on how to mitigate .

@snasovich
Copy link
Collaborator

@kingnarmer , please use the workaround from #44929 (comment) as it pins the charts repo to commit before this problematic updated chart was released.

As an overall update, we're working on releasing new version of chart that essentially rolls back CAPI version upgrade which should address the issue. This is currently undergoing QA process and rancher/charts#3700 is the PR to release this fixed chart.

@snasovich snasovich added status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels Mar 26, 2024
@sulaimantok
Copy link

sulaimantok commented Mar 27, 2024

Same here, use this workaround also work for me workaround

@daleckystepan
Copy link
Author

daleckystepan commented Mar 27, 2024

I tried to change CATTLE_SYSTEM_CATALOG to bundled but it has probably no effect if Rancher is already installed. Any other way to prevent those online updates and make it more transparent and managabale for us?

@avthart
Copy link

avthart commented Mar 27, 2024

I tried to change CATTLE_SYSTEM_CATALOG to bundled but it has probably no effect if Rancher is already installed. Any other way to prevent those online updates and make it more transparent and managabale for us?

Either wait for the fix or you can manually rollback rancher-provisioning-capi using Rancher Apps.

@qhris
Copy link

qhris commented Mar 27, 2024

The capi upgrade also makes it impossible to provision new clusters because of the same reason.

We verified that the workaround with setting the labels works on the kubeconfig secret.
Running rancher 2.8.3-rc6 was something we also tested that works.

@Denys-Janrain-L
Copy link

My list of installed apps is always empty, don't know why.
So only first two steps of workaround worked for me, after that I had to open rancher local console ( through browser ) and:
helm -n cattle-provisioning-capi-system rollback rancher-provisioning-capi 1
which rolled back it to 103.0.0+up0.0.1

@romarioschneider
Copy link

same issue

@snasovich
Copy link
Collaborator

snasovich commented Mar 27, 2024

QA validations were completed and rancher/charts#3700 has just been merged meaning rancher-provisioning-capi version 103.2.0+up0.0.1 is now released and all default-configured non-airgap v2.8.0-v2.8.2 Rancher deployments will automatically update to this fixed version of chart and the issue should be fixed.

Any users that applied the workaround to downgrade the chart to 103.0.0+up0.0.1 (e.g. by following instructions in #44929 (comment)) should now be free to rollback the workaround.

Important Note: Rancher automatically refreshes chart data every 6 hours. To force immediate refresh, please follow these steps:

  1. Select local cluster
  2. Open "Apps" -> "Repositories"
  3. Locate and check Rancher from the list of displayed repos
  4. Select "Refresh" and wait for repo status to update to "Active"
  5. rancher-provisioning-capi will upgrade to version 103.2.0+up0.0.1 shortly

Keeping the issue open for some time to ensure the fix works for all affected users.

@josh383451
Copy link

josh383451 commented Mar 27, 2024

QA validations were completed and rancher/charts#3700 has just been merged meaning rancher-provisioning-capi version 103.2.0+up0.0.1 is now released and all default-configured non-airgap v2.8.0-v2.8.2 Rancher deployments will automatically update to this fixed version of chart and the issue should be fixed.

Any users that applied the workaround to downgrade the chart to 103.0.0+up0.0.1 (e.g. by following instructions in #44929 (comment)) should now be free to rollback the workaround.

Keeping the issue open for some time to ensure the fix works for all affected users.

Can confirm this is working with provisioning AWS EC2 instances using 103.2.0+up0.0.1
image

@snasovich snasovich changed the title [BUG] All pools unavailable on all clusters Resolved - [BUG] All pools unavailable on all clusters Mar 27, 2024
@snasovich snasovich pinned this issue Mar 27, 2024
@dylanthepodman
Copy link

dylanthepodman commented Mar 27, 2024

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

image

@snasovich snasovich changed the title Resolved - [BUG] All pools unavailable on all clusters Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 Mar 27, 2024
@snasovich
Copy link
Collaborator

@dylanthepodman , thank you for reporting this. Most likely you're running into some different issue. Was provisioning working OK on the same setup earlier?

@Oats87
Copy link
Contributor

Oats87 commented Mar 27, 2024

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

image

Your screenshot shows that your cluster is waiting for a worker node to be registered (and on top of that, your cluster does not have a worker listed in your machine list.

@dylanthepodman
Copy link

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.
image

Your screenshot shows that your cluster is waiting for a worker node to be registered (and on top of that, your cluster does not have a worker listed in your machine list.

I tried, and it is working now. Thank you for mentioning this to me.

Interestingly enough, I tried this same setup in Rancher v2.6.12 and it did not have this problem.

ldevulder added a commit to ldevulder/ele-testhelpers that referenced this issue Mar 28, 2024
This is to avoid the issue seen in rancher/rancher#44929.

Signed-off-by: Loic Devulder <ldevulder@suse.com>
ldevulder added a commit to rancher-sandbox/ele-testhelpers that referenced this issue Mar 28, 2024
This is to avoid the issue seen in rancher/rancher#44929.

Signed-off-by: Loic Devulder <ldevulder@suse.com>
@Priyashetty17 Priyashetty17 unpinned this issue Mar 28, 2024
@pgonin pgonin pinned this issue Mar 29, 2024
@markusewalker markusewalker unpinned this issue Apr 3, 2024
@snasovich snasovich pinned this issue Apr 3, 2024
@snasovich snasovich unpinned this issue Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests