Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph mgr randomly has dashboard ssl enabled when ssl: false in helm cephClusterSpec #11064

Closed
jhoblitt opened this issue Sep 27, 2022 · 5 comments

Comments

@jhoblitt
Copy link
Contributor

Deviation from expected behavior:

I have created test ceph clusters multiple times per day, most business days, for the last couple of weeks. All cephcluster have been created with rook-ceph-cluster chart 1.10.1 using the same chart values, which contain:

cephClusterSpec:
  dashboard:
    enabled: true
    ssl: false

The mgr pods and svc are reliably configured to expose port 7000. E.g.

$ k -n rook-ceph get pod -l app=rook-ceph-mgr -ojson | jq '.items[].spec.containers[].ports'
[
  {
    "containerPort": 6800,
    "name": "mgr",
    "protocol": "TCP"
  },
  {
    "containerPort": 9283,
    "name": "http-metrics",
    "protocol": "TCP"
  },
  {
    "containerPort": 7000,
    "name": "dashboard",
    "protocol": "TCP"
  }
]
null
null
[
  {
    "containerPort": 6800,
    "name": "mgr",
    "protocol": "TCP"
  },
  {
    "containerPort": 9283,
    "name": "http-metrics",
    "protocol": "TCP"
  },
  {
    "containerPort": 7000,
    "name": "dashboard",
    "protocol": "TCP"
  }
]
null
null

However, twice now I have observed the mgr pod has been started without ssl being disabled. From the mgr pod logs:

debug 2022-09-27T17:27:13.596+0000 7fe5b3f91700  0 [dashboard INFO root] server: ssl=yes host=0.0.0.0 port=8443
debug 2022-09-27T17:27:13.600+0000 7fe5b3f91700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
debug 2022-09-27T17:27:13.605+0000 7f3060a80700  0 [dashboard INFO root] server: ssl=yes host=0.0.0.0 port=8443
debug 2022-09-27T17:27:13.605+0000 7f3060a80700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured

I would guesstimate it is happening around 1 out of 10 cephcluster creations.

AND, extremely strangely, the CephCluster doesn't list the ssl: false key. E.g.:

    dashboard:
      enabled: true
    dataDirHostPath: /var/lib/rook

I have confirmed the CRD does have the ssl key defined on the k8s cluster. E.g.:

              dashboard:
                description: Dashboard settings
                nullable: true
                properties:
                  enabled:
                    description: Enabled determines whether to enable the dashboard
                    type: boolean
                  port:
                    description: Port is the dashboard webserver port
                    maximum: 65535
                    minimum: 0
                    type: integer
                  ssl:
                    description: SSL determines whether SSL should be used
                    type: boolean

Clearly, something is going wrong with helm either eating the key or the operator is editing the cephCluster to remove it. Both possibilities seem rather extraordinary.

Expected behavior:

The mgr is consistently setup with ssl disabled.

How to reproduce it (minimal and precise):
Create and delete cephcluster repeatedly (probably > 10 times) and eventually it will happen.

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
---
operatorNamespace: rook-ceph

toolbox:
  enabled: true
  tolerations:
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 5
    - key: role
      operator: Equal
      value: storage-node
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: role
                operator: In
                values:
                  - storage-node

monitoring:
  enabled: true
  rulesNamespaceOverride: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v17.2.3
    allowUnsupported: false
  dataDirHostPath: /var/lib/rook
  skipUpgradeChecks: false
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  waitTimeoutForHealthyOSDInMinutes: 10
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    modules:
      - name: pg_autoscaler
        enabled: true
  dashboard:
    enabled: true
    ssl: false
  crashCollector:
    disable: false
  logCollector:
    enabled: true
    periodicity: 1d  # SUFFIX may be 'h' for hours or 'd' for days.
  cleanupPolicy:
    #confirmation: "yes-really-destroy-data"
    sanitizeDisks:
      method: quick
      dataSource: zero
      iteration: 1
    allowUninstallWithVolumes: false
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: role
                  operator: In
                  values:
                    - storage-node
      tolerations:
        - key: role
          operator: Equal
          value: storage-node
          effect: NoSchedule
  removeOSDsIfOutAndSafeToRemove: false
  #  priorityClassNames:
  #    all: rook-ceph-default-priority-class
  #    mon: rook-ceph-mon-priority-class
  #    osd: rook-ceph-osd-priority-class
  #    mgr: rook-ceph-mgr-priority-class
  storage:
    useAllNodes: false
    useAllDevices: false
    config:
      osdsPerDevice: "4"
    nodes:
      - name: kueyen02
        devices:
          - name: sdb
      - name: kueyen03
        devices:
          - name: sdb
      - name: kueyen04
        devices:
          - name: sdb
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 30
    manageMachineDisruptionBudgets: false
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
  resources:
    mgr:
      limits:
        cpu: "1000m"
        memory: "1Gi"
      requests:
        cpu: "500m"
        memory: "512Mi"
    mon:
      limits:
        cpu: "2000m"
        memory: "2Gi"
      requests:
        cpu: "1000m"
        memory: "1Gi"
    osd:
      limits:
        cpu: "2000m"
        memory: "8Gi"
      requests:
        cpu: "1000m"
        memory: "8Gi"
    prepareosd:
      limits:
        cpu: "500m"
        memory: "400Mi"
      requests:
        cpu: "500m"
        memory: "50Mi"
    mgr-sidecar:
      limits:
        cpu: "500m"
        memory: "100Mi"
      requests:
        cpu: "100m"
        memory: "40Mi"
    crashcollector:
      limits:
        cpu: "500m"
        memory: "60Mi"
      requests:
        cpu: "100m"
        memory: "60Mi"
    logcollector:
      limits:
        cpu: "500m"
        memory: "1Gi"
      requests:
        cpu: "100m"
        memory: "100Mi"
    cleanup:
      limits:
        cpu: "500m"
        memory: "1Gi"
      requests:
        cpu: "500m"
        memory: "100Mi"

ingress:
  dashboard:
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-staging
      kubernetes.io/ingress.class: nginx
      nginx.ingress.kubernetes.io/backend-protocol: HTTP
      nginx.ingress.kubernetes.io/server-snippet: |
        proxy_ssl_verify off;
    host:
      name: &hostname ceph.kueyen.ls.example.com
    tls:
      - hosts:
          - *hostname
        secretName: rook-ceph-mgr-dashboard-ingress-tls

cephBlockPools:
cephFileSystems:
cephFileSystemVolumeSnapshotClass:
cephBlockPoolsVolumeSnapshotClass:
cephObjectStores:
@jhoblitt jhoblitt added the bug label Sep 27, 2022
@parth-gr
Copy link
Member

Clearly, something is going wrong with the helm either eating the key or the operator editing the cephCluster to remove it. Both possibilities seem rather extraordinary.

Have also seen this scenario when installing the cluster and operator manifest

@edwardchenchen
Copy link

I can confirm I have the same issue
using

  chart: rook-ceph
  repoURL: https://charts.rook.io/release
  targetRevision: v1.10.4

the log in the manager pod

debug 2022-10-22T11:16:10.794+0000 7f8b6ace9700  0 [dashboard INFO root] server: ssl=yes host=:: port=8443

image
image
and then causing the rook-dashboard service to be exposed in a wrong port
image

@travisn
Copy link
Member

travisn commented Oct 24, 2022

There seems to be a race condition for the operator to configure the dashboard ssl setting. In a test cluster, I see the operator configured the ssl here:

2022-10-24 19:28:42.928250 I | op-config: setting "mgr.a"="mgr/dashboard/ssl"="false" option to the mon configuration database

In the mgr log, the following is observed. Notice the ssl is first using the default of ssl=yes two seconds before the timestamp of the operator setting it to false.

debug 2022-10-24T19:28:40.818+0000 7f1cda25c700  0 [dashboard INFO root] server: ssl=yes host=0.0.0.0 port=8443
debug 2022-10-24T19:28:40.818+0000 7f1cda25c700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
...
debug 2022-10-24T19:29:00.134+0000 7f98bf1fc700  0 [dashboard INFO root] server: ssl=no host=0.0.0.0 port=7000
debug 2022-10-24T19:29:00.135+0000 7f98bf1fc700  0 [dashboard INFO root] Configured CherryPy, starting engine...

When the dashboard is configured, the module is first enabled, then the operator sets the properties as seen in configureDashboardModuleSettings(). If there is any change to the defaults, the dashboard then will be restarted at the end. In my test cluster I do see the dashboard was restarted a few seconds after the initial start, which then caused the dashboard to be reloaded as expected.

2022-10-24 19:28:44.172255 I | op-mgr: dashboard config has changed. restarting the dashboard module

@jhoblitt @edwardchenchen A couple questions:

  1. Do you see this message in the operator about the config changing?
  2. Does restarting the mgr pod workaround the issue for you?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants