Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-7060] during rancher upgrade to 2.7.6 helm operation error with fleet-cleanup-clusterregistrations #1884

Closed
ebugit opened this issue Sep 7, 2023 · 7 comments

Comments

@ebugit
Copy link

ebugit commented Sep 7, 2023

Rancher Server Setup

  • Rancher version: 2.7.6
  • Installation option (Docker install/Helm Chart): Helm Chart
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2

Information about the Cluster

  • Kubernetes version: v1.24.16
  • Cluster Type (Local/Downstream): Local

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • Admin/Cluster

Describe the bug

During rancher upgrade to 2.7.6 from 2.7.5 using the following helm command:

helm upgrade rancher rancher-2.7.6.tgz --namespace cattle-system -f  values.yaml --version=2.7.6 --no-hooks

Command ends with:
Release "rancher" has been upgraded. Happy Helming!

Rancher pods restart correctly with the new version but we found helm opration pods in error state:

$ kubectl get pod
NAME                                         READY   STATUS      RESTARTS        AGE
helm-operation-6c9jj                         1/2     Error       0               65s
helm-operation-drq8f                         1/2     Error       0               55s
helm-operation-fdqht                         0/2     Completed   0               2m24s
helm-operation-gfq2q                         1/2     Error       0               70s
helm-operation-ghhl9                         1/2     Error       0               50s
helm-operation-gxf5m                         1/2     Error       0               40s
helm-operation-mqvmk                         1/2     Error       0               45s
helm-operation-mrxzs                         2/2     Running     0               2m12s
helm-operation-pv9bw                         1/2     Error       0               60s
rancher-5cd58895bb-mjp8n                     1/1     Running     0               3m40s
rancher-5cd58895bb-v4qtd                     1/1     Running     0               4m41s
rancher-5cd58895bb-vzgm8                     1/1     Running     1 (3m56s ago)   4m41s
rancher-webhook-7bc56f7f64-c9wx9             1/1     Running     0               30d
system-upgrade-controller-5b6457d644-fdpbx   1/1     Running     8 (33d ago)     221d

Checking logs from error pods show:

kubectl logs helm-operation-mqvmk helm
helm upgrade --force-adopt=true --history-max=5 --install=true --namespace=cattle-fleet-system --reset-values=true --timeout=5m0s --values=/home/shell/helm/values-fleet-102.1.1-up0.7.1.yaml --version=102.1.1+up0.7.1 --wait=true fleet /home/shell/helm/fleet-102.1.1-up0.7.1.tgz
checking 14 resources for changes
Looks like there are no changes for ServiceAccount "gitjob"
Looks like there are no changes for ServiceAccount "fleet-controller"
Patch ConfigMap "fleet-controller" in namespace cattle-fleet-system
Looks like there are no changes for ClusterRole "gitjob"
Looks like there are no changes for ClusterRole "fleet-controller"
Looks like there are no changes for ClusterRoleBinding "gitjob-binding"
Looks like there are no changes for ClusterRoleBinding "fleet-controller"
Created a new Role called "gitjob" in cattle-fleet-system

Patch Role "fleet-controller" in namespace cattle-fleet-system
Created a new RoleBinding called "gitjob" in cattle-fleet-system

Looks like there are no changes for RoleBinding "fleet-controller"
Looks like there are no changes for Service "gitjob"
Patch Deployment "gitjob" in namespace cattle-fleet-system
Patch Deployment "fleet-controller" in namespace cattle-fleet-system
Deleting ServiceAccount "fleet-controller-bootstrap" in namespace cattle-fleet-system...
Deleting ClusterRole "fleet-controller-bootstrap" in namespace ...
Deleting ClusterRoleBinding "fleet-controller-bootstrap" in namespace ...
beginning wait for 14 resources with timeout of 5m0s
Deployment is not ready: cattle-fleet-system/gitjob. 0 out of 1 expected pods are ready
Deployment is not ready: cattle-fleet-system/gitjob. 0 out of 1 expected pods are ready
Deployment is not ready: cattle-fleet-system/gitjob. 0 out of 1 expected pods are ready
Deployment is not ready: cattle-fleet-system/gitjob. 0 out of 1 expected pods are ready
Starting delete for "fleet-cleanup-clusterregistrations" Job
Ignoring delete failure for "fleet-cleanup-clusterregistrations" batch/v1, Kind=Job: jobs.batch "fleet-cleanup-clusterregistrations" not found
creating 1 resource(s)
Watching for changes to Job fleet-cleanup-clusterregistrations with timeout of 5m0s
Add/Modify event for fleet-cleanup-clusterregistrations: ADDED
fleet-cleanup-clusterregistrations: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
Add/Modify event for fleet-cleanup-clusterregistrations: MODIFIED
fleet-cleanup-clusterregistrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: UPGRADE FAILED: post-upgrade hooks failed: 1 error occurred:
        * timed out waiting for the condition

Reviewing the log show the following job that it does not complete:

Starting delete for "fleet-cleanup-clusterregistrations" Job

kubectl get job -n cattle-fleet-system
NAME                                 COMPLETIONS   DURATION   AGE
fleet-cleanup-clusterregistrations   0/1           2m57s      2m57s

This is the output of pod describe, showing that pod is trying to execute as Root.

kubectl describe pod fleet-cleanup-clusterregistrations-qhhhn -n cattle-fleet-system
Name:         fleet-cleanup-clusterregistrations-qhhhn
Namespace:    cattle-fleet-system
......
    State:          Waiting
      Reason:       CreateContainerConfigError
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  4m43s                   default-scheduler  Successfully assigned cattle-fleet-system/fleet-cleanup-clusterregistrations-qhhhn to xxxxxxxx
  Normal   Pulling    4m43s                   kubelet            Pulling image "xxxxxxxx/rancher/fleet-agent:v0.7.1"
  Normal   Pulled     4m42s                   kubelet            Successfully pulled image "xxxxxxxx/rancher/fleet-agent:v0.7.1" in 938.995085ms
  Warning  Failed     2m33s (x12 over 4m42s)  kubelet            Error: container has runAsNonRoot and image will run as root (pod: "fleet-cleanup-clusterregistrations-qhhhn_cattle-fleet-system(12a6bd52-f021-4c12-80e9-ec7a92290e7f)", container: cleanup)
  Normal   Pulled     2m33s (x11 over 4m41s)  kubelet            Container image "secregistry.satcen.europa.eu/rancher/fleet-agent:v0.7.1" already present on machine

Reviewing Security context pod configuration does not include user definition:

spec:
  containers:
  - args:
    - cleanup
    command:
    - fleet
    image: xxxx/rancher/fleet-agent:v0.7.1
    name: cleanup

    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true
  securityContext:
    fsGroup: 1
    supplementalGroups:
    - 1
  serviceAccount: fleet-controller

Reviewing service account permissions for this pod, show that cluster role does not have run as root permission.

kubectl get clusterrole fleet-controller -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    meta.helm.sh/release-name: fleet
    meta.helm.sh/release-namespace: cattle-fleet-system
  creationTimestamp: "2022-03-17T23:18:27Z"
  finalizers:
  - wrangler.cattle.io/auth-prov-v2-crole
  labels:
    app.kubernetes.io/managed-by: Helm
  name: fleet-controller
  resourceVersion: "252078754"
  uid: bd57429b-ca57-44e0-9fc4-be85a911af4a
rules:
- apiGroups:
  - gitjob.cattle.io
  resources:
  - '*'
  verbs:
  - '*'
- apiGroups:
  - fleet.cattle.io
  resources:
  - '*'
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - namespaces
  - serviceaccounts
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - secrets
  - configmaps
  verbs:
  - '*'
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterroles
  - clusterrolebindings
  - roles
  - rolebindings
  verbs:
  - '*'

Expected Result
We expect upgrade without errors

@sbulage
Copy link
Contributor

sbulage commented Sep 21, 2023

Successfully upgraded to the Rancher 2.7.6 from Rancher 2.7.5 without any error. Also the command:

satya@opensuse15:~> kubectl get job -A
NAMESPACE       NAME                        COMPLETIONS   DURATION   AGE
fleet-default   test-fleet-examples-2ff03   1/1           21s        11m
kube-system     helm-install-traefik        1/1           27s        45m
kube-system     helm-install-traefik-crd    1/1           24s        45m
satya@opensuse15:~> kubectl get pods -n cattle-system 
NAME                               READY   STATUS      RESTARTS   AGE
helm-operation-czs7c               0/2     Completed   0          15m
helm-operation-f5k74               0/2     Completed   0          45m
helm-operation-grfgr               0/2     Completed   0          14m
helm-operation-hpcnv               0/2     Completed   0          45m
helm-operation-nlj28               0/2     Completed   0          44m
helm-operation-qz66m               0/2     Completed   0          45m
helm-operation-vxvzc               0/2     Completed   0          45m
rancher-569b86c8f5-7s8kn           1/1     Running     0          17m
rancher-webhook-788c48b988-rlg6d   1/1     Running     0          45m

Also, fleet upgraded from 0.7.0 to 0.7.1 without any error.

Please let me know if anything else need to check.

@pdiaz-suse
Copy link

pdiaz-suse commented Sep 26, 2023

I have been able to reproduce the issue, but only in the case when the cis-1.6 profile is enabled in the underlying RKE2 Kubernetes cluster.

Helm operation pods are continuously failing:

root@ip-172-31-26-45:/etc/rancher/rke2# kubectl get pod -n cattle-system
NAME                              READY   STATUS      RESTARTS      AGE
helm-operation-2klh9              1/2     Error       0             28m
helm-operation-2n6nn              1/2     Error       0             52m
helm-operation-2vqfh              1/2     Error       0             54m
helm-operation-4xskv              1/2     Error       0             47m
helm-operation-55qhj              1/2     Error       0             7m55s
helm-operation-5mf85              1/2     Error       0             3m57s
helm-operation-5rplt              2/2     Running     0             3m52s
helm-operation-78gpf              1/2     Error       0             23m
helm-operation-8jcww              1/2     Error       0             52m
helm-operation-8ztt4              1/2     Error       0             43m
helm-operation-97l5p              1/2     Error       0             52m
helm-operation-b5sfs              1/2     Error       0             22m
helm-operation-cndwz              0/2     Completed   0             52m
helm-operation-dww9d              1/2     Error       0             52m
helm-operation-gc7cs              1/2     Error       0             8m57s
helm-operation-gg29f              1/2     Error       0             33m
helm-operation-h7wsk              1/2     Error       0             18m
helm-operation-hhvqt              1/2     Error       0             18m
helm-operation-j26v7              1/2     Error       0             37m
helm-operation-jsk4k              1/2     Error       0             51m
helm-operation-nv7zz              0/2     Completed   0             53m
helm-operation-qrqsr              1/2     Error       0             13m
helm-operation-qt7bf              1/2     Error       0             52m
helm-operation-r82vk              1/2     Error       0             46m
helm-operation-r9r8w              1/2     Error       0             28m
helm-operation-s66gv              1/2     Error       0             13m
helm-operation-v687s              1/2     Error       0             51m
helm-operation-z7q6c              1/2     Error       0             43m
helm-operation-zd8t5              1/2     Error       0             38m
helm-operation-zr559              1/2     Error       0             51m
helm-operation-zs9w2              1/2     Error       0             33m
rancher-6bf9cd485c-8d5fg          1/1     Running     1 (55m ago)   56m
rancher-6bf9cd485c-qd79p          1/1     Running     0             56m
rancher-6bf9cd485c-rjphw          1/1     Running     0             56m
rancher-webhook-998454b77-ghsgd   1/1     Running     0             52m

Due to the failure in the fleet-cleanup-clusterregistrations pod related with the

root@ip-172-31-26-45:/etc/rancher/rke2# kubectl get pod -n cattle-fleet-system
NAME                                       READY   STATUS                       RESTARTS   AGE
fleet-cleanup-clusterregistrations-vsbhq   0/1     CreateContainerConfigError   0          8m26s
fleet-controller-64f5b4585-shjjb           1/1     Running                      0          52m
gitjob-58dc7cb797-wr28d                    1/1     Running                      0          52m
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  9m33s                   default-scheduler  Successfully assigned cattle-fleet-system/fleet-cleanup-clusterregistrations-vsbhq to ip-172-31-26-45
  Warning  Failed     7m39s (x12 over 9m33s)  kubelet            Error: container has runAsNonRoot and image will run as root (pod: "fleet-cleanup-clusterregistrations-vsbhq_cattle-fleet-system(d32d3d68-6bbf-4783-9ecc-d6200284b411)", container: cleanup)
  Normal   Pulled     4m32s (x26 over 9m33s)  kubelet            Container image "rancher/fleet-agent:v0.8.0" already present on machine

@HoustonDad
Copy link

HoustonDad commented Oct 9, 2023

Also hitting this issue!

Rancher 2.7.7 and RKE2 1.24.x w/CIS Profile 1.6 enabled

@jcox10
Copy link

jcox10 commented Oct 9, 2023

This appears to be a missing PSP for hardened clusters using CIS profile, running <=1.24. On 1.25+ the entire cattle-fleet-system namespace is exempted from the PSA.

A quick workaround is to just bind the unrestricted PSP to the service account:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: unrestricted-psp
  namespace: cattle-fleet-system
rules:
- apiGroups:
  - extensions
  resourceNames:
  - system-unrestricted-psp
  resources:
  - podsecuritypolicies
  verbs:
    - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: unrestricted-psp
  namespace: cattle-fleet-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: unrestricted-psp
subjects:
- kind: ServiceAccount
  name: fleet-controller

Once the rancher charts are fixed, the Role and RoleBinding can be deleted.

@kkaempf kkaempf added the JIRA Must shout label Oct 11, 2023
@raulcabello raulcabello self-assigned this Oct 18, 2023
@raulcabello
Copy link
Contributor

raulcabello commented Oct 19, 2023

QA Template

Solution

Added Security Context to the cleanup job in #1862

Testing

Install rancher 2.7.5 in a hardened cluster rke2 1.24 (see issue for more info on the env)

Upgrade to the latest rancher should not give any error in the fleet-cleanup-clusterregistrations job

Additional info

Needs a new fleet RC

@kkaempf kkaempf transferred this issue from rancher/rancher Oct 20, 2023
@kkaempf kkaempf added this to the 2024-Q1-2.8x milestone Oct 20, 2023
@kkaempf kkaempf changed the title [BUG] during rancher upgrade to 2.7.6 helm operation error with fleet-cleanup-clusterregistrations [SURE-7060] during rancher upgrade to 2.7.6 helm operation error with fleet-cleanup-clusterregistrations Dec 8, 2023
@kkaempf kkaempf modified the milestones: v2.8-Next1, v2.8.3 Jan 11, 2024
@manno
Copy link
Member

manno commented Feb 29, 2024

PR #1862 needs a backport to v0.9

@kkaempf kkaempf modified the milestones: v2.8.3, v2.8-Next1 Mar 1, 2024
@manno manno changed the title [SURE-7060] during rancher upgrade to 2.7.6 helm operation error with fleet-cleanup-clusterregistrations [v0.9][SURE-7060] during rancher upgrade to 2.7.6 helm operation error with fleet-cleanup-clusterregistrations Mar 1, 2024
@manno manno changed the title [v0.9][SURE-7060] during rancher upgrade to 2.7.6 helm operation error with fleet-cleanup-clusterregistrations [SURE-7060] during rancher upgrade to 2.7.6 helm operation error with fleet-cleanup-clusterregistrations Mar 1, 2024
@manno manno modified the milestones: v2.8-Next1, v2.9.0 Mar 1, 2024
@mmartin24
Copy link
Collaborator

QA report

Testing considerations:

For hardening, I followed the steps detailed in this guide with few adjustments.
As psp is disabled after Kubernetes version 1.24 testing was done with psa and Kubernetes version > 1.25

Tested scenarios:
Scenario 1: Fresh installation hardened rke2 cluster on Rancher performing CIS with no errors.

Setup:

  • RKE2 version: rke2 v1.28.9+rke2r1
  • Hardening parameters used: cis and psa.yaml
  • Configured default Service account
  • Fresh installation of Rancher v2.9-6d87a11ea46b7571646d7c3d7af704584c39fd62-head
  • Confirmed successful Rancher deployment
  • Deployed CIS benchmark rke2-cis-1.8-profile-hardened with 71 passes, 0 errors and 48 warnings

2024-05-06_14-09

Scenario 2: Installation hardened rke2 cluster on Rancher 2.8 performing CIS with no errors, later upgrade to 2.9 and new CIS with no errors.

Setup:

  • RKE2 version: rke2 v1.26.15+rke2r1
  • Hardening parameters used: cis and psa.yaml
  • Configured default Service account
  • Fresh installation of Rancher v2.8-ec76f714a7d22be1d4266cf5385f0aef62a9a653-head
  • Confirmed successful Rancher deployment
  • Deployed CIS benchmark rke2-cis-1.8-profile-hardened with 71 passes, 0 errors and 48
    warnings
  • Upgraded to Rancher v2.9-6d87a11ea46b7571646d7c3d7af704584c39fd62-head
  • Deployed new CIS benchmark rke2-cis-1.8-profile-hardened again with 71 passes, 0 errors and 48 warnings
  • Checked fleet-cleanup-clusterregistrations job did not throw any error

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

10 participants