Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rook operator fails to properly manage one osd-prepare-job #8558

Closed
raz3k opened this issue Aug 18, 2021 · 4 comments · Fixed by #9116
Closed

rook operator fails to properly manage one osd-prepare-job #8558

raz3k opened this issue Aug 18, 2021 · 4 comments · Fixed by #9116
Labels

Comments

@raz3k
Copy link

raz3k commented Aug 18, 2021

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
Keeps looping with message:
op-osd: waiting... 2 of 3 OSD prepare jobs have finished processing and 3 of 3 OSDs have been updated

Expected behavior:
Should finish reconciliation.

How to reproduce it (minimal and precise):

It occurred on upgrading v.1.6.7 based chart to same v.1.6.7 based chart, thus restarting the operator.
We tried restarting the operator and deleting the configmap rook-ceph-osd-nodename-3-status, the thing that fixed it was manually deleting the job for nodename-3 and restarting the operator then it went into "reconciliation complete".

File(s) to submit:

$ kubectl get cm rook-ceph-osd-nodename-3-status -n kube-system -o yaml --> https://pastebin.com/LbxsVTFJ
$ kubectl get cephclusters.ceph.rook.io -A -o yaml --> https://pastebin.com/evNz5EFT
$ kubectl get pods -n kube-system --> https://pastebin.com/An64qYhU
$ kubectl logs rook-ceph-operator-5d684ccc5c-jlngj -n kube-system --> https://pastebin.com/YHwzTspv
$ kubectl logs rook-ceph-osd-prepare-nodename-3-4tq6w -n kube-system --> https://pastebin.com/fnAfJWBa
$ kubectl get job -n kube-system rook-ceph-osd-prepare-nodename-3 -o yaml --> https://pastebin.com/4jsx6vwy

Environment:

  • OS (e.g. from /etc/os-release): CentOS 7.10
  • Kernel (e.g. uname -a): 3.10.0-1160.36.2.el7.x86_64
  • Cloud provider or hardware configuration: OpenStack
  • Rook version (use rook version inside of a Rook Pod): v1.6.7
  • Storage backend version (e.g. for ceph do ceph -v): 16.2.2 (e8f22dde28889481f4dda2beb8a07788204821d3)
  • Kubernetes version (use kubectl version): GitVersion:"v1.21.1"
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Vanilla kubeadm deployed 3 master+workers
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN noout flag(s) set
@raz3k raz3k added the bug label Aug 18, 2021
@travisn
Copy link
Member

travisn commented Aug 18, 2021

Deleting the status configmap and osd prepare job was the right thing to do when the operator gets stuck. It has been a rare issue to track down, so do let us know if you see it again.

@prazumovsky
Copy link

Found again on rook v1.6.8 and ceph 15.2.13.
rco-waiting-prepare-jobs.log
rook-ceph.log

@travisn
Copy link
Member

travisn commented Oct 14, 2021

Thanks for the logs. Do you still have the osd status configmaps in that state, or did you already workaround and delete them? I can probably just work off the logs you provided if you don't have the configmaps.

@prazumovsky
Copy link

I already removed these configmaps, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants