-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator default of rook/ceph:master is bad idea #13843
Comments
I went to v1.12.11 and v17.2.7 and have no issues. I can kill the pods and they recover as well as |
Also, the issue I have in #13842 seems to be ok in 1.12.11; I'm able to pull metrics from |
So I believe the issue is that my rook operator was using I experienced this issue yet again in this helm chart and ceph version. |
I believe the issue is image:
# -- Image
repository: rook/ceph
# -- Image tag
# @default -- `master`
tag: master
# -- Image pull policy
pullPolicy: IfNotPresent The image is probably buggy. The image I was getting: https://hub.docker.com/layers/rook/ceph/master/images/sha256-3fb13dd7c2d7e7fa562c31365098ff91f55c543678dd1d7cdd9975f68e4f9a3d?context=explore This is pretty bad default though correct? If someones operator crashes and gets scheduled on a different node they'll end up downloading this new broken image likely? |
The latest rook release is |
@shell-skrimp It seems there are two core issues you are asking:
The mon quorum is required for the operator to reconcile. Otherwise, the operator is blocked and the ceph cluster is blocked as you found. Why are you restarting the mon pods? The operator is expected to update or restart the mons only when necessary. If all mons do restart at the same time, the quorum is expected to be restored when at least 2/3 of them start up again. If the quorum is not restored (e.g. if majority of mons are lost), then you need to restore the quorum and try the disaster recovery guide.
The published helm charts actually use the release tag. The |
Hey @travisn, I'll give you a synopsis of what I did to run into this issue. I was able to reproduce it twice. In the first instance I do not know what caused, but I now believe it was after I updated the operator chart, it was never able to reconcile. The reason why I believe this is because once I did a clean install this is what triggered the issue on the new cluster. My initial setup:
On the initial setup everything installed and was working fine, then the next day I noticed that one of the mons was not coming back, and I then decided to just clean install and use helm 1.12.11 and ceph 17.2.7(?). Cluster installed just fine, copied over 500GB just fine. Pods using the cephfs were just fine (like in other cluster). I then decided to randomly kill pods and deployments for mon/mds and operator was able to recover them just fine (a cluster outage of course but this is fine, it's not live, yet ;)). Then I decided to tweak the operator values.yaml (enabling monitoring) and the operator never recovered. Then, I killed the mon/mds/mgr and it never recovered (again, testing worst case here because what if power outage?). From here I went through the various bug reports and solutions in github, disabling upgrade checks, forcing allowing, etc. Nothing worked. I decided to downgrade the operator pod from To answer your question RE the operator helm chart, the default for |
How exactly did you install the helm chart? Did you follow these helm instructions?
I just installed with these commands and see the latest version (v1.13.5) running. The published charts don't use the master tag so please check the helm install. |
Yep I followed it exactly. If you kill some mons/mgr/mds and/or try to update the operator config you should be able to reproduce the issue. |
It also looks like the master image has changed since then, so it's possible fixed in some future release. I havent had issues since I locked to a non master release. |
Killing pods wouldn't cause the image to be updated to
The master tag changes every time a PR is merged to master, so it commonly is updated several times a day. |
Yea, I think this is the problem |
Your local rook-ceph-values.yaml must have the |
Yea, but what I'm saying is that:
|
Don't put a version in your values.yaml. The published chart has the correct version. By default you'll get the latest published version (currently v1.13.5). If you want a release version other than the latest, you can use the helm |
https://github.com/rook/rook/blob/v1.13.5/deploy/charts/rook-ceph/values.yaml#L10 |
So it looks like you're saying that the defaults are set in the chart repo already, I pulled: |
Our release process updates the values inside the chart with the release version. I understand the confusion though, so we will look at how to address this. |
This is fixed now in v1.13.6 with #13897. |
Similar to #12944
Helm: v1.13.5
Version: 18.2.1-0
If you
k rollout restart ceph-mon...
therook-ceph-operator
never reconciles itself, even if you setskipUpgradeChecks: true
This is worrisome because what if these nodes lost power suddenly?
I noticed that mon-a was not visible then I tried to manually start
rook-ceph-mon-a
withk rollout ...
. I decided to go one step further and restart the rest of the mons and it looks like I'm at the mercy of the operator working correctly?The text was updated successfully, but these errors were encountered: