ceph mgr randomly has dashboard ssl enabled when `ssl: false` in helm cephClusterSpec #11064

jhoblitt · 2022-09-27T18:00:13Z

Deviation from expected behavior:

I have created test ceph clusters multiple times per day, most business days, for the last couple of weeks. All cephcluster have been created with rook-ceph-cluster chart 1.10.1 using the same chart values, which contain:

cephClusterSpec:
  dashboard:
    enabled: true
    ssl: false

The mgr pods and svc are reliably configured to expose port 7000. E.g.

$ k -n rook-ceph get pod -l app=rook-ceph-mgr -ojson | jq '.items[].spec.containers[].ports'
[
  {
    "containerPort": 6800,
    "name": "mgr",
    "protocol": "TCP"
  },
  {
    "containerPort": 9283,
    "name": "http-metrics",
    "protocol": "TCP"
  },
  {
    "containerPort": 7000,
    "name": "dashboard",
    "protocol": "TCP"
  }
]
null
null
[
  {
    "containerPort": 6800,
    "name": "mgr",
    "protocol": "TCP"
  },
  {
    "containerPort": 9283,
    "name": "http-metrics",
    "protocol": "TCP"
  },
  {
    "containerPort": 7000,
    "name": "dashboard",
    "protocol": "TCP"
  }
]
null
null

However, twice now I have observed the mgr pod has been started without ssl being disabled. From the mgr pod logs:

debug 2022-09-27T17:27:13.596+0000 7fe5b3f91700  0 [dashboard INFO root] server: ssl=yes host=0.0.0.0 port=8443
debug 2022-09-27T17:27:13.600+0000 7fe5b3f91700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
debug 2022-09-27T17:27:13.605+0000 7f3060a80700  0 [dashboard INFO root] server: ssl=yes host=0.0.0.0 port=8443
debug 2022-09-27T17:27:13.605+0000 7f3060a80700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured

I would guesstimate it is happening around 1 out of 10 cephcluster creations.

AND, extremely strangely, the CephCluster doesn't list the ssl: false key. E.g.:

    dashboard:
      enabled: true
    dataDirHostPath: /var/lib/rook

I have confirmed the CRD does have the ssl key defined on the k8s cluster. E.g.:

              dashboard:
                description: Dashboard settings
                nullable: true
                properties:
                  enabled:
                    description: Enabled determines whether to enable the dashboard
                    type: boolean
                  port:
                    description: Port is the dashboard webserver port
                    maximum: 65535
                    minimum: 0
                    type: integer
                  ssl:
                    description: SSL determines whether SSL should be used
                    type: boolean

Clearly, something is going wrong with helm either eating the key or the operator is editing the cephCluster to remove it. Both possibilities seem rather extraordinary.

Expected behavior:

The mgr is consistently setup with ssl disabled.

How to reproduce it (minimal and precise):
Create and delete cephcluster repeatedly (probably > 10 times) and eventually it will happen.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

---
operatorNamespace: rook-ceph

toolbox:
  enabled: true
  tolerations:
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 5
    - key: role
      operator: Equal
      value: storage-node
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: role
                operator: In
                values:
                  - storage-node

monitoring:
  enabled: true
  rulesNamespaceOverride: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v17.2.3
    allowUnsupported: false
  dataDirHostPath: /var/lib/rook
  skipUpgradeChecks: false
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  waitTimeoutForHealthyOSDInMinutes: 10
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    modules:
      - name: pg_autoscaler
        enabled: true
  dashboard:
    enabled: true
    ssl: false
  crashCollector:
    disable: false
  logCollector:
    enabled: true
    periodicity: 1d  # SUFFIX may be 'h' for hours or 'd' for days.
  cleanupPolicy:
    #confirmation: "yes-really-destroy-data"
    sanitizeDisks:
      method: quick
      dataSource: zero
      iteration: 1
    allowUninstallWithVolumes: false
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: role
                  operator: In
                  values:
                    - storage-node
      tolerations:
        - key: role
          operator: Equal
          value: storage-node
          effect: NoSchedule
  removeOSDsIfOutAndSafeToRemove: false
  #  priorityClassNames:
  #    all: rook-ceph-default-priority-class
  #    mon: rook-ceph-mon-priority-class
  #    osd: rook-ceph-osd-priority-class
  #    mgr: rook-ceph-mgr-priority-class
  storage:
    useAllNodes: false
    useAllDevices: false
    config:
      osdsPerDevice: "4"
    nodes:
      - name: kueyen02
        devices:
          - name: sdb
      - name: kueyen03
        devices:
          - name: sdb
      - name: kueyen04
        devices:
          - name: sdb
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 30
    manageMachineDisruptionBudgets: false
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
  resources:
    mgr:
      limits:
        cpu: "1000m"
        memory: "1Gi"
      requests:
        cpu: "500m"
        memory: "512Mi"
    mon:
      limits:
        cpu: "2000m"
        memory: "2Gi"
      requests:
        cpu: "1000m"
        memory: "1Gi"
    osd:
      limits:
        cpu: "2000m"
        memory: "8Gi"
      requests:
        cpu: "1000m"
        memory: "8Gi"
    prepareosd:
      limits:
        cpu: "500m"
        memory: "400Mi"
      requests:
        cpu: "500m"
        memory: "50Mi"
    mgr-sidecar:
      limits:
        cpu: "500m"
        memory: "100Mi"
      requests:
        cpu: "100m"
        memory: "40Mi"
    crashcollector:
      limits:
        cpu: "500m"
        memory: "60Mi"
      requests:
        cpu: "100m"
        memory: "60Mi"
    logcollector:
      limits:
        cpu: "500m"
        memory: "1Gi"
      requests:
        cpu: "100m"
        memory: "100Mi"
    cleanup:
      limits:
        cpu: "500m"
        memory: "1Gi"
      requests:
        cpu: "500m"
        memory: "100Mi"

ingress:
  dashboard:
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-staging
      kubernetes.io/ingress.class: nginx
      nginx.ingress.kubernetes.io/backend-protocol: HTTP
      nginx.ingress.kubernetes.io/server-snippet: |
        proxy_ssl_verify off;
    host:
      name: &hostname ceph.kueyen.ls.example.com
    tls:
      - hosts:
          - *hostname
        secretName: rook-ceph-mgr-dashboard-ingress-tls

cephBlockPools:
cephFileSystems:
cephFileSystemVolumeSnapshotClass:
cephBlockPoolsVolumeSnapshotClass:
cephObjectStores:

The text was updated successfully, but these errors were encountered:

parth-gr · 2022-10-19T13:35:29Z

Clearly, something is going wrong with the helm either eating the key or the operator editing the cephCluster to remove it. Both possibilities seem rather extraordinary.

Have also seen this scenario when installing the cluster and operator manifest

edwardchenchen · 2022-10-22T11:19:23Z

I can confirm I have the same issue
using

  chart: rook-ceph
  repoURL: https://charts.rook.io/release
  targetRevision: v1.10.4

the log in the manager pod

debug 2022-10-22T11:16:10.794+0000 7f8b6ace9700  0 [dashboard INFO root] server: ssl=yes host=:: port=8443

and then causing the rook-dashboard service to be exposed in a wrong port

travisn · 2022-10-24T19:49:40Z

There seems to be a race condition for the operator to configure the dashboard ssl setting. In a test cluster, I see the operator configured the ssl here:

2022-10-24 19:28:42.928250 I | op-config: setting "mgr.a"="mgr/dashboard/ssl"="false" option to the mon configuration database

In the mgr log, the following is observed. Notice the ssl is first using the default of ssl=yes two seconds before the timestamp of the operator setting it to false.

debug 2022-10-24T19:28:40.818+0000 7f1cda25c700  0 [dashboard INFO root] server: ssl=yes host=0.0.0.0 port=8443
debug 2022-10-24T19:28:40.818+0000 7f1cda25c700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
...
debug 2022-10-24T19:29:00.134+0000 7f98bf1fc700  0 [dashboard INFO root] server: ssl=no host=0.0.0.0 port=7000
debug 2022-10-24T19:29:00.135+0000 7f98bf1fc700  0 [dashboard INFO root] Configured CherryPy, starting engine...

When the dashboard is configured, the module is first enabled, then the operator sets the properties as seen in configureDashboardModuleSettings(). If there is any change to the defaults, the dashboard then will be restarted at the end. In my test cluster I do see the dashboard was restarted a few seconds after the initial start, which then caused the dashboard to be reloaded as expected.

2022-10-24 19:28:44.172255 I | op-mgr: dashboard config has changed. restarting the dashboard module

@jhoblitt @edwardchenchen A couple questions:

Do you see this message in the operator about the config changing?
Does restarting the mgr pod workaround the issue for you?

github-actions · 2022-12-23T20:02:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions · 2022-12-30T20:02:12Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

jhoblitt added the bug label Sep 27, 2022

github-actions bot added the wontfix label Dec 23, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceph mgr randomly has dashboard ssl enabled when `ssl: false` in helm cephClusterSpec #11064

ceph mgr randomly has dashboard ssl enabled when `ssl: false` in helm cephClusterSpec #11064

jhoblitt commented Sep 27, 2022

parth-gr commented Oct 19, 2022

edwardchenchen commented Oct 22, 2022

travisn commented Oct 24, 2022

github-actions bot commented Dec 23, 2022

github-actions bot commented Dec 30, 2022

ceph mgr randomly has dashboard ssl enabled when ssl: false in helm cephClusterSpec #11064

ceph mgr randomly has dashboard ssl enabled when ssl: false in helm cephClusterSpec #11064

Comments

jhoblitt commented Sep 27, 2022

parth-gr commented Oct 19, 2022

edwardchenchen commented Oct 22, 2022

travisn commented Oct 24, 2022

github-actions bot commented Dec 23, 2022

github-actions bot commented Dec 30, 2022

ceph mgr randomly has dashboard ssl enabled when `ssl: false` in helm cephClusterSpec #11064

ceph mgr randomly has dashboard ssl enabled when `ssl: false` in helm cephClusterSpec #11064