rook operator fails to properly manage one osd-prepare-job #8558

raz3k · 2021-08-18T09:46:59Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
Keeps looping with message:
op-osd: waiting... 2 of 3 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated

Expected behavior:
Should finish reconciliation.

How to reproduce it (minimal and precise):

It occurred on upgrading v.1.6.7 based chart to same v.1.6.7 based chart, thus restarting the operator.
We tried restarting the operator and deleting the configmap rook-ceph-osd-nodename-3-status, the thing that fixed it was manually deleting the job for nodename-3 and restarting the operator then it went into "reconciliation complete".

File(s) to submit:

$ kubectl get cm rook-ceph-osd-nodename-3-status -n kube-system -o yaml --> https://pastebin.com/LbxsVTFJ
$ kubectl get cephclusters.ceph.rook.io -A -o yaml --> https://pastebin.com/evNz5EFT
$ kubectl get pods -n kube-system --> https://pastebin.com/An64qYhU
$ kubectl logs rook-ceph-operator-5d684ccc5c-jlngj -n kube-system --> https://pastebin.com/YHwzTspv
$ kubectl logs rook-ceph-osd-prepare-nodename-3-4tq6w -n kube-system --> https://pastebin.com/fnAfJWBa
$ kubectl get job -n kube-system rook-ceph-osd-prepare-nodename-3 -o yaml --> https://pastebin.com/4jsx6vwy

Environment:

OS (e.g. from /etc/os-release): CentOS 7.10
Kernel (e.g. uname -a): 3.10.0-1160.36.2.el7.x86_64
Cloud provider or hardware configuration: OpenStack
Rook version (use rook version inside of a Rook Pod): v1.6.7
Storage backend version (e.g. for ceph do ceph -v): 16.2.2 (e8f22dde28889481f4dda2beb8a07788204821d3)
Kubernetes version (use kubectl version): GitVersion:"v1.21.1"
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Vanilla kubeadm deployed 3 master+workers
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN noout flag(s) set

The text was updated successfully, but these errors were encountered:

travisn · 2021-08-18T16:56:36Z

Deleting the status configmap and osd prepare job was the right thing to do when the operator gets stuck. It has been a rare issue to track down, so do let us know if you see it again.

prazumovsky · 2021-10-14T13:54:34Z

Found again on rook v1.6.8 and ceph 15.2.13.
rco-waiting-prepare-jobs.log
rook-ceph.log

travisn · 2021-10-14T17:10:24Z

Thanks for the logs. Do you still have the osd status configmaps in that state, or did you already workaround and delete them? I can probably just work off the logs you provided if you don't have the configmaps.

prazumovsky · 2021-10-14T17:48:21Z

I already removed these configmaps, unfortunately.

raz3k added the bug label Aug 18, 2021

travisn mentioned this issue Nov 5, 2021

osd: Increase wait timeout for osd prepare cleanup #9116

Merged

10 tasks

travisn closed this as completed in #9116 Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rook operator fails to properly manage one osd-prepare-job #8558

rook operator fails to properly manage one osd-prepare-job #8558

raz3k commented Aug 18, 2021

travisn commented Aug 18, 2021

prazumovsky commented Oct 14, 2021

travisn commented Oct 14, 2021

prazumovsky commented Oct 14, 2021

rook operator fails to properly manage one osd-prepare-job #8558

rook operator fails to properly manage one osd-prepare-job #8558

Comments

raz3k commented Aug 18, 2021

travisn commented Aug 18, 2021

prazumovsky commented Oct 14, 2021

travisn commented Oct 14, 2021

prazumovsky commented Oct 14, 2021