Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][CAPR] rancher-provisioning-capi-patch-sa job failing due to lack of exclusion from PSA enforcement #42719

Closed
thaneunsoo opened this issue Sep 8, 2023 · 7 comments
Assignees
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@thaneunsoo
Copy link
Contributor

Rancher Server Setup

  • Rancher version: v2.7-head 787c056
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):

Information about the Cluster

  • Kubernetes version: v1.26.7+rke2r1
  • Cluster Type (Local/Downstream):
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):
      AWS node driver

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: Admin

Describe the bug

Downstream cluster is unable to provision and cluster is stuck at waiting for viable init node. I don't see the machines getting created on the AWS portal and the latest log message is Creating server [fleet-default/auto-aws-kkswl-pool0-01d54a4c-4kr6x] of kind (Amazonec2Machine) for machine auto-aws-kkswl-pool0-74f9b78f74xcl6q9-fjnch in infrastructure provider

To Reproduce

  1. Provision RKE2 AWS node driver cluster

Result
Cluster gets stuck while provisioning with the status waiting for viable init node

Expected Result

Cluster is able to provision successfully

Screenshots

image
@thaneunsoo thaneunsoo added kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker labels Sep 8, 2023
@thaneunsoo thaneunsoo added this to the 2024-Q1-v2.7x milestone Sep 8, 2023
@slickwarren slickwarren reopened this Sep 9, 2023
@slickwarren
Copy link
Contributor

This is not resolved for us.
We looked into our automation and nothing looks out of the ordinary. Here's what we know:

  • this only fails for rke1 local clusters (rke2 on the same 1.26.7 k8s version deploys just fine)
  • if you manually set the rancher-provisioning-capi-patch-sa 's namespace to enforce:psa:privileged and the cattle-fleet-local-system to enforce:psa:privileged , both rancher-provisioning-capi-patch-sa job and the fleet-agent job are successful
  • without doing this, the cluster never resolves and is basically unusable

@slickwarren slickwarren added [zube]: To Triage team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement and removed kind/bug Issues that are defects reported by users or that we know have reached a real release labels Sep 9, 2023
@slickwarren
Copy link
Contributor

slickwarren commented Sep 9, 2023

tested versions: (all using rke v1.4.8)
(Tim): v2.7.7-rc4
Caleb: v2.7-head and v2.8-head

@aiyengar2
Copy link
Contributor

aiyengar2 commented Sep 11, 2023

I would suspect that the issue here is that changes that were made to introduce the new CAPI chart may not have been coordinated with changes to FeatureAppNS list, which should track all feature chart / system namespaces to exclude PSA enforcements from them.

cattle-fleet-local-system and cattle-provisioning-capi-system (the CAPI Provisioning Namespace that is set as the chart's release namespace) do not appear to be in this list, so the fix should simply be to add those namespaces to the list.

@Oats87 Oats87 changed the title [BUG][RKE2] Downstream cluster is stuck on waiting for viable init node [BUG][CAPR] rancher-provisioning-capi-patch-sa job failing due to lack of exclusion from PSA enforcement Sep 11, 2023
@Oats87 Oats87 self-assigned this Sep 11, 2023
@zube zube bot removed the [zube]: To Triage label Sep 11, 2023
@slickwarren
Copy link
Contributor

slickwarren commented Sep 11, 2023

using these namspaces in the pod security configuration, I was able to resolve the issue for fleet. It appears that cattle-provisioning-capi-system is missing from this list though.

@snasovich
Copy link
Collaborator

/forwardport v2.8.0

@thaneunsoo
Copy link
Contributor Author

Test Environment:

Rancher version: v2.7-head 1d25044
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: RKE2 node driver cluster


Testing:

Tested this issue with the following steps:

  1. Provision RKE2 AWS node driver cluster

Result
The job no longer fails but the cluster still isn't able to come up.
image

@Oats87 Should I close this ticket now that the job isn't faililng and open a different issue? The cluster went to delete after about 5 minutes and is now just stuck in deleting
image
image

@thaneunsoo
Copy link
Contributor Author

My bad @Oats87 this was an issue with our jenkins job which was fixed here #42743

Closing this issue as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

5 participants