Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.5.6 -> v1.6.10 doesn't upgrades RGW #9134

Closed
ppintogbm opened this issue Nov 9, 2021 · 10 comments · Fixed by #9137
Closed

v1.5.6 -> v1.6.10 doesn't upgrades RGW #9134

ppintogbm opened this issue Nov 9, 2021 · 10 comments · Fixed by #9137
Labels

Comments

@ppintogbm
Copy link

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
After follows the procedure to upgrade the operator, all daemons go to v1.6.10 labels, but rgw.

rook-ceph-crashcollector-0aa8cf2bdd56f6b59019dd40200e2eff       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-crashcollector-1388f9d3418e5bb5b82a6ed36e96fd1e       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-crashcollector-be9fc4de900c9ffd13cb624eb5101fc6       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-crashcollector-f000d3b367fdd0b39a4117483d51dbcf       req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mgr-a         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mon-i         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mon-t         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-mon-u         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-osd-0         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-osd-2         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-osd-3         req/upd/avl: 1/1/1      rook-version=v1.6.10
rook-ceph-rgw-my-store-a        req/upd/avl: 1/1/1      rook-version=v1.5.6

Same for an ceph-version upgrade (v15.2.8-0 -> v15.2.13-0)

rook-ceph-crashcollector-0aa8cf2bdd56f6b59019dd40200e2eff       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-1388f9d3418e5bb5b82a6ed36e96fd1e       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-be9fc4de900c9ffd13cb624eb5101fc6       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-f000d3b367fdd0b39a4117483d51dbcf       req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mgr-a         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-i         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-t         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-u         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-0         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-2         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-3         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-rgw-my-store-a        req/upd/avl: 1/1/1      ceph-version=15.2.8-0

Also having HEALTH_WARN because insceure global_id reclaim for clients and mons

Expected behavior:
RGW upgrade and HEALTH_OK to complete a full upgrade to v1.7.x

How to reproduce it (minimal and precise):
Follow the upgrade process from https://rook.io/docs/rook/v1.6/ceph-upgrade.html (including CSI upgrades).
Disconnected environment, so all container images are being pulled from a local mirror.

File(s) to submit:

cluster.yaml

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: bnc000mon52.openshift.bncrcp.inst.bncr.fi.cr:5000/ceph/ceph:v15.2.13
  cleanupPolicy:
    sanitizeDisks: {}
  crashCollector:
    disable: false
  dashboard:
    enabled: true
    ssl: true
  dataDirHostPath: /var/lib/rook
  disruptionManagement:
    machineDisruptionBudgetNamespace: openshift-machine-api
    osdMaintenanceTimeout: 30
  external:
    enable: false
  healthCheck:
    daemonHealth:
      mon: {}
      osd: {}
      status: {}
  logCollector: {}
  mgr:
    modules:
    - enabled: true
      name: pg_autoscaler
  mon:
    count: 3
    volumeClaimTemplate:
      metadata: {}
      spec:
        resources:
          requests:
            storage: 10Gi
        storageClassName: thin
      status: {}
  monitoring: {}
  network:
    hostNetwork: false
    ipFamily: IPv4
    provider: ""
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: node-role.kubernetes.io/storage-node
              operator: Exists
      tolerations:
      - key: storage-node
        operator: Exists
    mgr: {}
    mon: {}
    osd: {}
  removeOSDsIfOutAndSafeToRemove: false
  resources:
    mds:
      limits:
        memory: 4Gi
      requests:
        memory: 512Mi
    mgr:
      limits:
        memory: 512Mi
      requests:
        memory: 512Mi
    mon:
      limits:
        memory: 2Gi
      requests:
        memory: 512Mi
    osd:
      limits:
        memory: 8Gi
      requests:
        memory: 512Mi
  security:
    kms: {}
  storage:
    storageClassDeviceSets:
    - count: 3
      name: set2
      placement:
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd-prepare
              topologyKey: kubernetes.io/hostname
            weight: 100
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd-prepare
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
      resources: {}
      tuneDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 750Gi
          storageClassName: thin
          volumeMode: Block

operator.log

Environment:

  • OS (e.g. from /etc/os-release):
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="46.82.202103211221-0"
VERSION_ID="4.6"
OPENSHIFT_VERSION="4.6"
RHEL_VERSION="8.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 46.82.202103211221-0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.6"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.6"
OSTREE_VERSION='46.82.202103211221-0'
  • Kernel (e.g. uname -a):
    Linux bnc000mon48.openshift.bncrcp.inst.bncr.fi.cr 4.18.0-193.47.1.el8_2.x86_64 #1 SMP Thu Mar 4 03:03:32 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: vSphere
  • Kubernetes version (use kubectl version):
Client Version: 4.6.21
Server Version: 4.6.23
Kubernetes Version: v1.19.0+263ee0d
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Openshift
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
cluster:
    id:     2b25ae32-25c0-4a17-8d2c-adb1e22e63ef
    health: HEALTH_WARN
            client is using insecure global_id reclaim
            mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum i,t,u (age 99m)
    mgr: a(active, since 99m)
    osd: 3 osds: 3 up (since 94m), 3 in (since 7d)
    rgw: 1 daemon active (my.store.a)

  task status:

  data:
    pools:   9 pools, 209 pgs
    objects: 181.11k objects, 623 GiB
    usage:   1.8 TiB used, 380 GiB / 2.2 TiB avail
    pgs:     209 active+clean

  io:
    client:   770 KiB/s wr, 0 op/s rd, 25 op/s wr
@ppintogbm ppintogbm added the bug label Nov 9, 2021
@thotz
Copy link
Contributor

thotz commented Nov 10, 2021

I am guessing upgrading didn't happen for RGW due to HEALTH WARN in ceph status, but I am wondering why the instance number is CephObjectStoreSpec is keep changing

@ppintogbm
Copy link
Author

@thotz the HEALTH WARN cames out after moving to ceph v15.2.13, but the rook version labels on my-store RGW never changes before that.

Also, sorry for the changes on the instance number... That was me trying to check if there is an error on the CephObjectStore spec that wasn't allowing the upgrade. My CephCluster before the upgrade had 2 storageClassDeviceSet, one with count: 0 and when i patch the ceph version, got a new error telling that this value must be greater than 0. Believing that there is any other spec changes, try changing the instance count from CephObjectStore, trying to catch any spec change.

Maybe this could help as timeline
Rook Upgrade -> HEALTH_OK but without RGW Upgrade
Ceph Upgrade -> HEALTH_WARN

@travisn
Copy link
Member

travisn commented Nov 10, 2021

In the operator log I see that you must have restarted the operator, and updated the rgw settings to start 3 daemons, then scale back to 0, and back to 1 rgw daemon. Did you see the number of rgw daemons change? Or the rgw pods were never created or updated? I would certainly expect that the rgw daemons would be updated. What does the image on the rgw deployment(s) show? The previous ceph image?

@ppintogbm
Copy link
Author

Yes @travisn i restarted the operator. With the instance count changes there wasn't any new pods or even when scaled down to 0, the same pods kept running.

@travisn
Copy link
Member

travisn commented Nov 10, 2021

Ok I found a bug in the code where errors are being swallowed when creating/updating the rgw pod. You must be hitting some error and then the rgw update is ignored. The only place that looks like should cause this error is in this method if the tls secret is not found with the certification. Does your CephObjectStore CR have the tls enabled or perhaps still needs to be configured?

@ppintogbm
Copy link
Author

I don't remember and today i don't have access to the environment (it's an on prem env), but actually believe that there is no needed, because it's just for workloads usage.

@travisn
Copy link
Member

travisn commented Nov 10, 2021

I don't remember and today i don't have access to the environment (it's an on prem env), but actually believe that there is no needed, because it's just for workloads usage.

There is no need for tls? Ok, anyway we will get this fix in the next release and see if we can get a real error message out of the log after that.

@ppintogbm
Copy link
Author

Fine, but there is another question @travisn ...

For some reason, after the update to v1.6.10, i remember that the field SSLCertificateRef from my-store CephObjectStore had changed to none instead empty string (""). Maybe i'm not reading right the method, but maybe the change to none is the reason for not find the corresponding secret?

@travisn
Copy link
Member

travisn commented Nov 10, 2021

That looks likely. If the value is none instead of "", the operator will look for the tls secret and fail.

@ppintogbm
Copy link
Author

yes... but i didn't set that value... was an automatic update after operator upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants