-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ceph MDS deployment not updated/created in case of MDS_ALL_DOWN #5846
Comments
@leseb How about if the upgrade looks for certain HEALTH_ERR codes and continues the upgrade if the codes are only from a list of known codes that are ok for the upgrade?
An alternative idea is that we shouldn't even block the reconcile of MDS deployments based on the ceph health. The HEALTH_ERR check for upgrades makes more sense only for mons and osds IMO. |
The list of codes is interesting, but this makes me nervous at the same time... |
What is the procedure for restarting the MDS in this case? |
This should now be marked as fixed. |
Yes, resolved with #6494 |
This allows the catch-22 situation where the filesystem cannot be reconciled because there is no MDS but there is no MDS because the operator has not reconciled the filesystem and brought up the MDS pods. Closes #5967, #5846 Signed-off-by: Lalit Maganti <lalitm@google.com> (cherry picked from commit 88f16e4)
Is this a bug report or feature request?
Deviation from expected behavior:
When Ceph reports
MDS_ALL_DOWN
, the Rook Ceph operator does not create/update the MDS deployments:I brought myself into this situation by setting resources limits triggering
OOMkilled
for both MDS. After raising the limit for thecephfilesystem
CRD, I noticed that the operator does not update the MDS deployment. While I could have adjusted the deployment manually, I just went for deleting the deployments assuming the operator would restore them. Obviously, it does not.(Side Note: A 4 GiB limit causing
OOMkilled
for a nearly empty CephFS looks a bit strange to me. Depending on further analysis I will file another issue for it.)Expected behavior:
The Rook Ceph Operator should create/update the MDS deployments in case of
MDS_ALL_DOWN
. When this is too dangerous in general, an option similar toskipUpgradeChecks
would help.Also, it might be worth considering reverting to the previous deployment if Pods do not start with changed parameters.
How to reproduce it (minimal and precise):
cephfilesystem
CRDor
Environment:
uname -a
):Linux n0201 5.3.0-62-generic #56~18.04.1-Ubuntu SMP Wed Jun 24 16:17:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
rook version
inside of a Rook Pod):rook: v1.3.8
ceph -v
):ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
kubectl version
): 1.15.4kubeadm
ceph health
in the Rook Ceph toolbox):HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds daemon; 1 filesystem is offline; insufficient standby MDS daemons available
The text was updated successfully, but these errors were encountered: