Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator default of rook/ceph:master is bad idea #13843

Closed
shell-skrimp opened this issue Mar 1, 2024 · 19 comments
Closed

Operator default of rook/ceph:master is bad idea #13843

shell-skrimp opened this issue Mar 1, 2024 · 19 comments
Labels

Comments

@shell-skrimp
Copy link

shell-skrimp commented Mar 1, 2024

Similar to #12944

Helm: v1.13.5
Version: 18.2.1-0

If you k rollout restart ceph-mon... the rook-ceph-operator never reconciles itself, even if you set skipUpgradeChecks: true

This is worrisome because what if these nodes lost power suddenly?

I noticed that mon-a was not visible then I tried to manually start rook-ceph-mon-a with k rollout .... I decided to go one step further and restart the rest of the mons and it looks like I'm at the mercy of the operator working correctly?

2024-03-01 00:02:56.473146 I | ceph-spec: ceph-file-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:02:56.473192 I | ceph-spec: ceph-fs-subvolumegroup-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:03:06.473699 I | ceph-spec: ceph-file-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:03:06.473746 I | ceph-spec: ceph-fs-subvolumegroup-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-03-01T00:01:42Z LastChanged:2024-02-29T23:06:12Z PreviousHealth:HEALTH_WARN Capacity:{TotalBytes:48009345761280 UsedBytes:1680641310720 AvailableBytes:46328704450560 LastUpdated:2024-02-29T23:31:48Z} Versions:<nil> FSID:}
2024-03-01 00:03:12.761208 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2024-03-01 00:03:13.300621 I | op-mon: mon a is not yet running
2024-03-01 00:03:13.321545 I | op-mon: mon b is not yet running
2024-03-01 00:03:13.418760 I | op-mon: mon c is not yet running
2024-03-01 00:03:13.418770 I | op-mon: mons running: []
@shell-skrimp
Copy link
Author

I went to v1.12.11 and v17.2.7 and have no issues. I can kill the pods and they recover as well as k rollout .... Unsure on what the issue is with 1.13.5

@shell-skrimp
Copy link
Author

shell-skrimp commented Mar 1, 2024

Also, the issue I have in #13842 seems to be ok in 1.12.11; I'm able to pull metrics from rook-ceph-mgr:9283/metrics without having the full prometheus stack installed.

@shell-skrimp
Copy link
Author

So I believe the issue is that my rook operator was using rook/ceph:master. I switched it to rook/ceph:v1.12.11-18.g582c940e8.

I experienced this issue yet again in this helm chart and ceph version.

@shell-skrimp
Copy link
Author

I believe the issue is

image:
  # -- Image
  repository: rook/ceph
  # -- Image tag
  # @default -- `master`
  tag: master
  # -- Image pull policy
  pullPolicy: IfNotPresent

The image is probably buggy. The image I was getting: https://hub.docker.com/layers/rook/ceph/master/images/sha256-3fb13dd7c2d7e7fa562c31365098ff91f55c543678dd1d7cdd9975f68e4f9a3d?context=explore

This is pretty bad default though correct? If someones operator crashes and gets scheduled on a different node they'll end up downloading this new broken image likely?

@shell-skrimp shell-skrimp changed the title Reconcile loop on k rollout restart v1.13.5 Operator default of rook/ceph:master is bad idea Mar 1, 2024
@rkachach
Copy link
Contributor

rkachach commented Mar 1, 2024

So I believe the issue is that my rook operator was using rook/ceph:master. I switched it to rook/ceph:v1.12.11-18.g582c940e8.

I experienced this issue yet again in this helm chart and ceph version.

The latest rook release is 1.13.5 which comes with a lot of fixes. I'd recommend upgrading your setup and see if you are still experimenting issues or not.

@travisn
Copy link
Member

travisn commented Mar 1, 2024

@shell-skrimp It seems there are two core issues you are asking:

  1. You are restarting all the mons at the same time and finding they don't come back up?

The mon quorum is required for the operator to reconcile. Otherwise, the operator is blocked and the ceph cluster is blocked as you found. Why are you restarting the mon pods? The operator is expected to update or restart the mons only when necessary. If all mons do restart at the same time, the quorum is expected to be restored when at least 2/3 of them start up again. If the quorum is not restored (e.g. if majority of mons are lost), then you need to restore the quorum and try the disaster recovery guide.

  1. You are getting the master tag from the helm chart?

The published helm charts actually use the release tag. The master tag is only an artifact of the github repo. Are you installing the published charts according to the Rook helm docs?

@shell-skrimp
Copy link
Author

shell-skrimp commented Mar 1, 2024

Hey @travisn,

I'll give you a synopsis of what I did to run into this issue. I was able to reproduce it twice. In the first instance I do not know what caused, but I now believe it was after I updated the operator chart, it was never able to reconcile. The reason why I believe this is because once I did a clean install this is what triggered the issue on the new cluster.

My initial setup:

  • Helm v1.13.5 with ceph 18.2.0(?)
  • 8 bare metals with about ~30 OSDs
  • ARM/x86 mixed cluster

On the initial setup everything installed and was working fine, then the next day I noticed that one of the mons was not coming back, and ceph -s indicated that it was out, etc. I did not think to check the rook operator logs at the time because I am nooby with rook :). I then decided to KO all the mon and mds pods because I expect them to recover (what if there was power outage? etc). None of them came back, and then I noticed that the operator was just looping over the mons.

I then decided to just clean install and use helm 1.12.11 and ceph 17.2.7(?). Cluster installed just fine, copied over 500GB just fine. Pods using the cephfs were just fine (like in other cluster). I then decided to randomly kill pods and deployments for mon/mds and operator was able to recover them just fine (a cluster outage of course but this is fine, it's not live, yet ;)). Then I decided to tweak the operator values.yaml (enabling monitoring) and the operator never recovered. Then, I killed the mon/mds/mgr and it never recovered (again, testing worst case here because what if power outage?). From here I went through the various bug reports and solutions in github, disabling upgrade checks, forcing allowing, etc. Nothing worked. I decided to downgrade the operator pod from rook/ceph:master to v1.12.11-18.g582c940e8. Once I did this the operator reconciled just fine within a couple minutes, even with all the mon/mds/mgr killed manually. I believe that the current rook/ceph:master (see hash above) is bugged in some way that prevents reconciliation.

To answer your question RE the operator helm chart, the default for image.tag is master which I believe is what caused the issue here.

@travisn
Copy link
Member

travisn commented Mar 4, 2024

How exactly did you install the helm chart? Did you follow these helm instructions?

helm repo add rook-release https://charts.rook.io/release
helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph -f values.yaml

I just installed with these commands and see the latest version (v1.13.5) running. The published charts don't use the master tag so please check the helm install.

@shell-skrimp
Copy link
Author

Yep I followed it exactly. If you kill some mons/mgr/mds and/or try to update the operator config you should be able to reproduce the issue.

@shell-skrimp
Copy link
Author

It also looks like the master image has changed since then, so it's possible fixed in some future release. I havent had issues since I locked to a non master release.

@travisn
Copy link
Member

travisn commented Mar 4, 2024

Yep I followed it exactly. If you kill some mons/mgr/mds and/or try to update the operator config you should be able to reproduce the issue.

Killing pods wouldn't cause the image to be updated to master. How exactly did you "update the operator config"?

It also looks like the master image has changed since then, so it's possible fixed in some future release. I havent had issues since I locked to a non master release.

The master tag changes every time a PR is merged to master, so it commonly is updated several times a day.

@shell-skrimp
Copy link
Author

shell-skrimp commented Mar 4, 2024

How exactly did you "update the operator config"?

helm upgrade -n rook-ceph -f ../rook-ceph-values.yaml rook-ceph rook-release/rook-ceph

The master tag changes every time a PR is merged to master, so it commonly is updated several times a day.

Yea, I think this is the problem

@travisn
Copy link
Member

travisn commented Mar 4, 2024

Your local rook-ceph-values.yaml must have the tag:master. You'll need to remove that so the chart will use the release version. Your values.yaml is only expected to contain the values you want to modify from the released chart.

@shell-skrimp
Copy link
Author

Yea, but what I'm saying is that:

  1. What's a good version to put there? There are a lot of different versions when you check docker hub and no version is recommended.
  2. The default of master may lead to a bad user experience; they build a cluster, then it crashes on operator update and so they dont really know how to resolve it until they start changing operator versions

@travisn
Copy link
Member

travisn commented Mar 4, 2024

Don't put a version in your values.yaml. The published chart has the correct version. By default you'll get the latest published version (currently v1.13.5). If you want a release version other than the latest, you can use the helm --version flag. This would show the released versions: helm search repo --versions

@shell-skrimp
Copy link
Author

Don't put a version in your values.yaml. The published chart has the correct version. By default you'll get the latest published version (currently v1.13.5). If you want a release version other than the latest, you can use the helm --version flag. This would show the released versions: helm search repo --versions

https://github.com/rook/rook/blob/v1.13.5/deploy/charts/rook-ceph/values.yaml#L10

@shell-skrimp
Copy link
Author

So it looks like you're saying that the defaults are set in the chart repo already, I pulled: helm show values rook-release/rook-ceph --version v1.12.11 which shows the tag as v1.12.11. Normally, what I do is just fork the values.yaml file from the project and commit that into version control. But, I see where the difference is here now.

@travisn
Copy link
Member

travisn commented Mar 4, 2024

Our release process updates the values inside the chart with the release version. I understand the confusion though, so we will look at how to address this.

@travisn
Copy link
Member

travisn commented Mar 7, 2024

This is fixed now in v1.13.6 with #13897.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants