Operator default of rook/ceph:master is bad idea #13843

shell-skrimp · 2024-03-01T00:06:36Z

Similar to #12944

Helm: v1.13.5
Version: 18.2.1-0

If you k rollout restart ceph-mon... the rook-ceph-operator never reconciles itself, even if you set skipUpgradeChecks: true

This is worrisome because what if these nodes lost power suddenly?

I noticed that mon-a was not visible then I tried to manually start rook-ceph-mon-a with k rollout .... I decided to go one step further and restart the rest of the mons and it looks like I'm at the mercy of the operator working correctly?

2024-03-01 00:02:56.473146 I | ceph-spec: ceph-file-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:02:56.473192 I | ceph-spec: ceph-fs-subvolumegroup-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:03:06.473699 I | ceph-spec: ceph-file-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:03:06.473746 I | ceph-spec: ceph-fs-subvolumegroup-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:03:12.761208 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2024-03-01 00:03:13.300621 I | op-mon: mon a is not yet running
2024-03-01 00:03:13.321545 I | op-mon: mon b is not yet running
2024-03-01 00:03:13.418760 I | op-mon: mon c is not yet running
2024-03-01 00:03:13.418770 I | op-mon: mons running: []

The text was updated successfully, but these errors were encountered:

shell-skrimp · 2024-03-01T02:42:53Z

I went to v1.12.11 and v17.2.7 and have no issues. I can kill the pods and they recover as well as k rollout .... Unsure on what the issue is with 1.13.5

shell-skrimp · 2024-03-01T02:45:19Z

Also, the issue I have in #13842 seems to be ok in 1.12.11; I'm able to pull metrics from rook-ceph-mgr:9283/metrics without having the full prometheus stack installed.

shell-skrimp · 2024-03-01T06:17:34Z

So I believe the issue is that my rook operator was using rook/ceph:master. I switched it to rook/ceph:v1.12.11-18.g582c940e8.

I experienced this issue yet again in this helm chart and ceph version.

shell-skrimp · 2024-03-01T06:28:59Z

I believe the issue is

image:
  # -- Image
  repository: rook/ceph
  # -- Image tag
  # @default -- `master`
  tag: master
  # -- Image pull policy
  pullPolicy: IfNotPresent

The image is probably buggy. The image I was getting: https://hub.docker.com/layers/rook/ceph/master/images/sha256-3fb13dd7c2d7e7fa562c31365098ff91f55c543678dd1d7cdd9975f68e4f9a3d?context=explore

This is pretty bad default though correct? If someones operator crashes and gets scheduled on a different node they'll end up downloading this new broken image likely?

rkachach · 2024-03-01T11:05:31Z

So I believe the issue is that my rook operator was using rook/ceph:master. I switched it to rook/ceph:v1.12.11-18.g582c940e8.

I experienced this issue yet again in this helm chart and ceph version.

The latest rook release is 1.13.5 which comes with a lot of fixes. I'd recommend upgrading your setup and see if you are still experimenting issues or not.

travisn · 2024-03-01T16:38:56Z

@shell-skrimp It seems there are two core issues you are asking:

You are restarting all the mons at the same time and finding they don't come back up?

The mon quorum is required for the operator to reconcile. Otherwise, the operator is blocked and the ceph cluster is blocked as you found. Why are you restarting the mon pods? The operator is expected to update or restart the mons only when necessary. If all mons do restart at the same time, the quorum is expected to be restored when at least 2/3 of them start up again. If the quorum is not restored (e.g. if majority of mons are lost), then you need to restore the quorum and try the disaster recovery guide.

You are getting the master tag from the helm chart?

The published helm charts actually use the release tag. The master tag is only an artifact of the github repo. Are you installing the published charts according to the Rook helm docs?

shell-skrimp · 2024-03-01T18:10:44Z

Hey @travisn,

I'll give you a synopsis of what I did to run into this issue. I was able to reproduce it twice. In the first instance I do not know what caused, but I now believe it was after I updated the operator chart, it was never able to reconcile. The reason why I believe this is because once I did a clean install this is what triggered the issue on the new cluster.

My initial setup:

Helm v1.13.5 with ceph 18.2.0(?)
8 bare metals with about ~30 OSDs
ARM/x86 mixed cluster

On the initial setup everything installed and was working fine, then the next day I noticed that one of the mons was not coming back, and ceph -s indicated that it was out, etc. I did not think to check the rook operator logs at the time because I am nooby with rook :). I then decided to KO all the mon and mds pods because I expect them to recover (what if there was power outage? etc). None of them came back, and then I noticed that the operator was just looping over the mons.

I then decided to just clean install and use helm 1.12.11 and ceph 17.2.7(?). Cluster installed just fine, copied over 500GB just fine. Pods using the cephfs were just fine (like in other cluster). I then decided to randomly kill pods and deployments for mon/mds and operator was able to recover them just fine (a cluster outage of course but this is fine, it's not live, yet ;)). Then I decided to tweak the operator values.yaml (enabling monitoring) and the operator never recovered. Then, I killed the mon/mds/mgr and it never recovered (again, testing worst case here because what if power outage?). From here I went through the various bug reports and solutions in github, disabling upgrade checks, forcing allowing, etc. Nothing worked. I decided to downgrade the operator pod from rook/ceph:master to v1.12.11-18.g582c940e8. Once I did this the operator reconciled just fine within a couple minutes, even with all the mon/mds/mgr killed manually. I believe that the current rook/ceph:master (see hash above) is bugged in some way that prevents reconciliation.

To answer your question RE the operator helm chart, the default for image.tag is master which I believe is what caused the issue here.

travisn · 2024-03-04T22:34:47Z

How exactly did you install the helm chart? Did you follow these helm instructions?

helm repo add rook-release https://charts.rook.io/release
helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph -f values.yaml

I just installed with these commands and see the latest version (v1.13.5) running. The published charts don't use the master tag so please check the helm install.

shell-skrimp · 2024-03-04T22:36:13Z

Yep I followed it exactly. If you kill some mons/mgr/mds and/or try to update the operator config you should be able to reproduce the issue.

shell-skrimp · 2024-03-04T22:37:16Z

It also looks like the master image has changed since then, so it's possible fixed in some future release. I havent had issues since I locked to a non master release.

travisn · 2024-03-04T22:40:40Z

Yep I followed it exactly. If you kill some mons/mgr/mds and/or try to update the operator config you should be able to reproduce the issue.

Killing pods wouldn't cause the image to be updated to master. How exactly did you "update the operator config"?

It also looks like the master image has changed since then, so it's possible fixed in some future release. I havent had issues since I locked to a non master release.

The master tag changes every time a PR is merged to master, so it commonly is updated several times a day.

shell-skrimp · 2024-03-04T22:43:37Z

How exactly did you "update the operator config"?

helm upgrade -n rook-ceph -f ../rook-ceph-values.yaml rook-ceph rook-release/rook-ceph

The master tag changes every time a PR is merged to master, so it commonly is updated several times a day.

Yea, I think this is the problem

travisn · 2024-03-04T22:58:30Z

Your local rook-ceph-values.yaml must have the tag:master. You'll need to remove that so the chart will use the release version. Your values.yaml is only expected to contain the values you want to modify from the released chart.

shell-skrimp · 2024-03-04T23:01:00Z

Yea, but what I'm saying is that:

What's a good version to put there? There are a lot of different versions when you check docker hub and no version is recommended.
The default of master may lead to a bad user experience; they build a cluster, then it crashes on operator update and so they dont really know how to resolve it until they start changing operator versions

travisn · 2024-03-04T23:04:40Z

Don't put a version in your values.yaml. The published chart has the correct version. By default you'll get the latest published version (currently v1.13.5). If you want a release version other than the latest, you can use the helm --version flag. This would show the released versions: helm search repo --versions

shell-skrimp · 2024-03-04T23:07:00Z

Don't put a version in your values.yaml. The published chart has the correct version. By default you'll get the latest published version (currently v1.13.5). If you want a release version other than the latest, you can use the helm --version flag. This would show the released versions: helm search repo --versions

https://github.com/rook/rook/blob/v1.13.5/deploy/charts/rook-ceph/values.yaml#L10

shell-skrimp · 2024-03-04T23:20:00Z

So it looks like you're saying that the defaults are set in the chart repo already, I pulled: helm show values rook-release/rook-ceph --version v1.12.11 which shows the tag as v1.12.11. Normally, what I do is just fork the values.yaml file from the project and commit that into version control. But, I see where the difference is here now.

travisn · 2024-03-04T23:47:50Z

Our release process updates the values inside the chart with the release version. I understand the confusion though, so we will look at how to address this.

travisn · 2024-03-07T21:38:24Z

This is fixed now in v1.13.6 with #13897.

shell-skrimp added the bug label Mar 1, 2024

shell-skrimp mentioned this issue Mar 1, 2024

How do I get mgr metrics to work? [helm charts] #13842

Closed

shell-skrimp changed the title ~~Reconcile loop on k rollout restart v1.13.5~~ Operator default of rook/ceph:master is bad idea Mar 1, 2024

travisn mentioned this issue Mar 7, 2024

build: Update the release version to v1.13.6 #13897

Merged

6 tasks

travisn closed this as completed Mar 7, 2024

travisn mentioned this issue Mar 11, 2024

Serviceaccount 'rook-ceph-default' is not being created during deployment through helm (causes mon startup failure) #13914

Closed

CalmVibes mentioned this issue Mar 22, 2024

One set of pools for RGWs isolated by namespaces #11411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator default of rook/ceph:master is bad idea #13843

Operator default of rook/ceph:master is bad idea #13843

shell-skrimp commented Mar 1, 2024 •

edited

Loading

shell-skrimp commented Mar 1, 2024

shell-skrimp commented Mar 1, 2024 •

edited

Loading

shell-skrimp commented Mar 1, 2024

shell-skrimp commented Mar 1, 2024

rkachach commented Mar 1, 2024 •

edited

Loading

travisn commented Mar 1, 2024

shell-skrimp commented Mar 1, 2024 •

edited

Loading

travisn commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024

travisn commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024 •

edited

Loading

travisn commented Mar 4, 2024 •

edited

Loading

shell-skrimp commented Mar 4, 2024

travisn commented Mar 4, 2024 •

edited

Loading

shell-skrimp commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024

travisn commented Mar 4, 2024

travisn commented Mar 7, 2024

Operator default of rook/ceph:master is bad idea #13843

Operator default of rook/ceph:master is bad idea #13843

Comments

shell-skrimp commented Mar 1, 2024 • edited Loading

shell-skrimp commented Mar 1, 2024

shell-skrimp commented Mar 1, 2024 • edited Loading

shell-skrimp commented Mar 1, 2024

shell-skrimp commented Mar 1, 2024

rkachach commented Mar 1, 2024 • edited Loading

travisn commented Mar 1, 2024

shell-skrimp commented Mar 1, 2024 • edited Loading

travisn commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024

travisn commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024 • edited Loading

travisn commented Mar 4, 2024 • edited Loading

shell-skrimp commented Mar 4, 2024

travisn commented Mar 4, 2024 • edited Loading

shell-skrimp commented Mar 4, 2024

shell-skrimp commented Mar 4, 2024

travisn commented Mar 4, 2024

travisn commented Mar 7, 2024

shell-skrimp commented Mar 1, 2024 •

edited

Loading

shell-skrimp commented Mar 1, 2024 •

edited

Loading

rkachach commented Mar 1, 2024 •

edited

Loading

shell-skrimp commented Mar 1, 2024 •

edited

Loading

shell-skrimp commented Mar 4, 2024 •

edited

Loading

travisn commented Mar 4, 2024 •

edited

Loading

travisn commented Mar 4, 2024 •

edited

Loading