v1.5.6 -> v1.6.10 doesn't upgrades RGW #9134

ppintogbm · 2021-11-09T23:57:40Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
After follows the procedure to upgrade the operator, all daemons go to v1.6.10 labels, but rgw.

rook-ceph-crashcollector-0aa8cf2bdd56f6b59019dd40200e2eff       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-crashcollector-1388f9d3418e5bb5b82a6ed36e96fd1e       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-crashcollector-be9fc4de900c9ffd13cb624eb5101fc6       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-crashcollector-f000d3b367fdd0b39a4117483d51dbcf       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mgr-a         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mon-i         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mon-t         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mon-u         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-osd-0         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-osd-2         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-osd-3         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-rgw-my-store-a        req/upd/avl: 1/1/1      rook-version=v1.5.6

Same for an ceph-version upgrade (v15.2.8-0 -> v15.2.13-0)

rook-ceph-crashcollector-0aa8cf2bdd56f6b59019dd40200e2eff       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-1388f9d3418e5bb5b82a6ed36e96fd1e       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-be9fc4de900c9ffd13cb624eb5101fc6       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-f000d3b367fdd0b39a4117483d51dbcf       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mgr-a         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-i         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-t         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-u         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-0         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-2         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-3         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-rgw-my-store-a        req/upd/avl: 1/1/1      ceph-version=15.2.8-0

Also having HEALTH_WARN because insceure global_id reclaim for clients and mons

Expected behavior:
RGW upgrade and HEALTH_OK to complete a full upgrade to v1.7.x

How to reproduce it (minimal and precise):
Follow the upgrade process from https://rook.io/docs/rook/v1.6/ceph-upgrade.html (including CSI upgrades).
Disconnected environment, so all container images are being pulled from a local mirror.

File(s) to submit:

cluster.yaml

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: bnc000mon52.openshift.bncrcp.inst.bncr.fi.cr:5000/ceph/ceph:v15.2.13
  cleanupPolicy:
    sanitizeDisks: {}
  crashCollector:
    disable: false
  dashboard:
    enabled: true
    ssl: true
  dataDirHostPath: /var/lib/rook
  disruptionManagement:
    machineDisruptionBudgetNamespace: openshift-machine-api
    osdMaintenanceTimeout: 30
  external:
    enable: false
  healthCheck:
    daemonHealth:
      mon: {}
      osd: {}
      status: {}
  logCollector: {}
  mgr:
    modules:
    - enabled: true
      name: pg_autoscaler
  mon:
    count: 3
    volumeClaimTemplate:
      metadata: {}
      spec:
        resources:
          requests:
            storage: 10Gi
        storageClassName: thin
      status: {}
  monitoring: {}
  network:
    hostNetwork: false
    ipFamily: IPv4
    provider: ""
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: node-role.kubernetes.io/storage-node
              operator: Exists
      tolerations:
      - key: storage-node
        operator: Exists
    mgr: {}
    mon: {}
    osd: {}
  removeOSDsIfOutAndSafeToRemove: false
  resources:
    mds:
      limits:
        memory: 4Gi
      requests:
        memory: 512Mi
    mgr:
      limits:
        memory: 512Mi
      requests:
        memory: 512Mi
    mon:
      limits:
        memory: 2Gi
      requests:
        memory: 512Mi
    osd:
      limits:
        memory: 8Gi
      requests:
        memory: 512Mi
  security:
    kms: {}
  storage:
    storageClassDeviceSets:
    - count: 3
      name: set2
      placement:
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd-prepare
              topologyKey: kubernetes.io/hostname
            weight: 100
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd-prepare
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
      resources: {}
      tuneDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 750Gi
          storageClassName: thin
          volumeMode: Block

operator.log

Environment:

OS (e.g. from /etc/os-release):

NAME="Red Hat Enterprise Linux CoreOS"
VERSION="46.82.202103211221-0"
VERSION_ID="4.6"
OPENSHIFT_VERSION="4.6"
RHEL_VERSION="8.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 46.82.202103211221-0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.6"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.6"
OSTREE_VERSION='46.82.202103211221-0'

Kernel (e.g. uname -a):
Linux bnc000mon48.openshift.bncrcp.inst.bncr.fi.cr 4.18.0-193.47.1.el8_2.x86_64 #1 SMP Thu Mar 4 03:03:32 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: vSphere
Kubernetes version (use kubectl version):

Client Version: 4.6.21
Server Version: 4.6.23
Kubernetes Version: v1.19.0+263ee0d

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Openshift
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

cluster:
    id:     2b25ae32-25c0-4a17-8d2c-adb1e22e63ef
    health: HEALTH_WARN
            client is using insecure global_id reclaim
            mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum i,t,u (age 99m)
    mgr: a(active, since 99m)
    osd: 3 osds: 3 up (since 94m), 3 in (since 7d)
    rgw: 1 daemon active (my.store.a)

  task status:

  data:
    pools:   9 pools, 209 pgs
    objects: 181.11k objects, 623 GiB
    usage:   1.8 TiB used, 380 GiB / 2.2 TiB avail
    pgs:     209 active+clean

  io:
    client:   770 KiB/s wr, 0 op/s rd, 25 op/s wr

The text was updated successfully, but these errors were encountered:

thotz · 2021-11-10T04:40:05Z

I am guessing upgrading didn't happen for RGW due to HEALTH WARN in ceph status, but I am wondering why the instance number is CephObjectStoreSpec is keep changing

ppintogbm · 2021-11-10T05:56:02Z

@thotz the HEALTH WARN cames out after moving to ceph v15.2.13, but the rook version labels on my-store RGW never changes before that.

Also, sorry for the changes on the instance number... That was me trying to check if there is an error on the CephObjectStore spec that wasn't allowing the upgrade. My CephCluster before the upgrade had 2 storageClassDeviceSet, one with count: 0 and when i patch the ceph version, got a new error telling that this value must be greater than 0. Believing that there is any other spec changes, try changing the instance count from CephObjectStore, trying to catch any spec change.

Maybe this could help as timeline
Rook Upgrade -> HEALTH_OK but without RGW Upgrade
Ceph Upgrade -> HEALTH_WARN

travisn · 2021-11-10T14:59:52Z

In the operator log I see that you must have restarted the operator, and updated the rgw settings to start 3 daemons, then scale back to 0, and back to 1 rgw daemon. Did you see the number of rgw daemons change? Or the rgw pods were never created or updated? I would certainly expect that the rgw daemons would be updated. What does the image on the rgw deployment(s) show? The previous ceph image?

ppintogbm · 2021-11-10T15:06:33Z

Yes @travisn i restarted the operator. With the instance count changes there wasn't any new pods or even when scaled down to 0, the same pods kept running.

travisn · 2021-11-10T16:10:50Z

Ok I found a bug in the code where errors are being swallowed when creating/updating the rgw pod. You must be hitting some error and then the rgw update is ignored. The only place that looks like should cause this error is in this method if the tls secret is not found with the certification. Does your CephObjectStore CR have the tls enabled or perhaps still needs to be configured?

ppintogbm · 2021-11-10T17:01:29Z

I don't remember and today i don't have access to the environment (it's an on prem env), but actually believe that there is no needed, because it's just for workloads usage.

travisn · 2021-11-10T17:17:43Z

I don't remember and today i don't have access to the environment (it's an on prem env), but actually believe that there is no needed, because it's just for workloads usage.

There is no need for tls? Ok, anyway we will get this fix in the next release and see if we can get a real error message out of the log after that.

ppintogbm · 2021-11-10T17:27:51Z

Fine, but there is another question @travisn ...

For some reason, after the update to v1.6.10, i remember that the field SSLCertificateRef from my-store CephObjectStore had changed to none instead empty string (""). Maybe i'm not reading right the method, but maybe the change to none is the reason for not find the corresponding secret?

travisn · 2021-11-10T17:34:39Z

That looks likely. If the value is none instead of "", the operator will look for the tls secret and fail.

ppintogbm · 2021-11-10T18:04:53Z

yes... but i didn't set that value... was an automatic update after operator upgrade.

ppintogbm added the bug label Nov 9, 2021

travisn mentioned this issue Nov 10, 2021

rgw: Raise errors when rgw daemon fails to be created #9137

Merged

10 tasks

travisn closed this as completed in #9137 Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.5.6 -> v1.6.10 doesn't upgrades RGW #9134

v1.5.6 -> v1.6.10 doesn't upgrades RGW #9134

ppintogbm commented Nov 9, 2021

thotz commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

v1.5.6 -> v1.6.10 doesn't upgrades RGW #9134

v1.5.6 -> v1.6.10 doesn't upgrades RGW #9134

Comments

ppintogbm commented Nov 9, 2021

thotz commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021

travisn commented Nov 10, 2021

ppintogbm commented Nov 10, 2021