New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ceph monitors in crash loop #10110
Ceph monitors in crash loop #10110
Comments
Mon logs with debug level set to 3
Note: The reason why pod is not crashing anymore because I increased livenessProve timers. |
@log1cb0mb What are your liveness probe values now to get it working? |
@travisn Its not exactly working. The mon daemon is still crashing, its just that due to increased probe timeout the pod stays alive and by the time probe would fail, the mon daemon/service comes back up going through its same loop all over again.
|
I went ahead and i even dared to test with latest quincy release (v17.2.0) and same behaviour. With quincy, there were other issues as well for e.g OSD prepare failing claiming there is filesystem on drive (ceph_bluestore) even though the filesystem itself was created by OSD prepare job. I think during preperation the Pod/jobs fails or dies unexpectedly and when it comes back up, it finds the bluestore filesystem so never goes further with spinning up OSD pods. I am starting to think its something related to kubernetes components version. I even tried downgrading kube-proxy (only) to On the cluster in questions, i have several other services running just fine such as: ingress controller, cert-manager, external-dns etc. |
I believe, i have narrowed down the issue. It starting to seem like not ceph related issue but some FCOS release updates. Managed to reproduce the issue in virtual environment as with latest FCOS However the crash occurs at that very exact command
Will investigate more and verify packages updates on FCOS end. |
@travisn Any suggestions regarding particular packages I should be looking at FCOS end that are required/used by ceph? |
Narrowed down last working release. Issue occurs with releases after
Among other changes, one major upgrade is Kernel from Update: with |
It looks like there are instances of failures including I suspect this issue is a Kernel issue since containers still use the host kernel. This is probably an issue that should be raised with the Ceph tracker since I don't think Rook itself would be introducing this issue. What is the version of Ceph (if any) installed on FCOS? |
I suspect it to be kernel issue as well. There is no Ceph or related packages installed by default on FCOS. |
We have a very similar issue here. The ceph cluster is only working for a few minutes than it comes the problem with the monitors. Environment:
Ceph status
Crashed monitor logs
|
Yes, its same behavior. I have not tried it again with newer Fedora kernel versions but I am expecting issue persists. I had tried to open a bug to Ceph bugzilla tracker but it did not allow me creating account for some reason. |
I have a similar problem [root@m-0 certs.d]# kubectl get pods -n rook-ceph -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-rbdplugin-2qxpc 3/3 Running 0 47m 192.168.113.172 m-1 <none> <none>
csi-rbdplugin-provisioner-85644d444f-2hrrs 6/6 Running 0 47m 10.233.10.236 m-2 <none> <none>
csi-rbdplugin-provisioner-85644d444f-z6jvc 6/6 Running 1 (37m ago) 47m 10.233.57.134 m-0 <none> <none>
csi-rbdplugin-wn7fh 3/3 Running 0 47m 192.168.113.173 m-2 <none> <none>
csi-rbdplugin-zlf69 3/3 Running 0 47m 192.168.113.171 m-0 <none> <none>
rook-ceph-crashcollector-m-0-668848dcb4-qv6t9 1/1 Running 0 37m 10.233.57.151 m-0 <none> <none>
rook-ceph-crashcollector-m-1-658ff96b4c-s2wvq 1/1 Running 0 37m 10.233.183.10 m-1 <none> <none>
rook-ceph-crashcollector-m-2-8c986658-xvzd4 1/1 Running 0 38m 10.233.10.239 m-2 <none> <none>
rook-ceph-mgr-a-686ddcdfd-7zzf9 2/2 Running 0 39m 10.233.183.5 m-1 <none> <none>
rook-ceph-mgr-b-64754fbbc4-6zhf7 2/2 Running 0 39m 10.233.57.148 m-0 <none> <none>
rook-ceph-mon-a-57df58b479-lw68k 0/1 CrashLoopBackOff 11 (86s ago) 45m 10.233.10.238 m-2 <none> <none>
rook-ceph-mon-b-5cdb8ff9c7-sdqfr 0/1 CrashLoopBackOff 11 (65s ago) 41m 10.233.57.136 m-0 <none> <none>
rook-ceph-mon-c-7899d87468-m9sws 1/1 Running 0 41m 10.233.183.63 m-1 <none> <none>
rook-ceph-operator-79dfc7cbc7-q9zlv 1/1 Running 0 51m 10.233.57.138 m-0 <none> <none>
rook-ceph-osd-0-657cfdc84b-9dr8f 1/1 Running 0 37m 10.233.57.153 m-0 <none> <none>
rook-ceph-osd-1-5d5ccdf77d-qtx26 1/1 Running 0 37m 10.233.183.11 m-1 <none> <none>
rook-ceph-osd-2-6f7f8f8b97-72b75 1/1 Running 0 37m 10.233.10.241 m-2 <none> <none>
rook-ceph-osd-prepare-m-0--1-qzw4l 0/1 Completed 0 30m 10.233.57.156 m-0 <none> <none>
rook-ceph-osd-prepare-m-1--1-fx6hd 0/1 Completed 0 30m 10.233.183.17 m-1 <none> <none>
rook-ceph-osd-prepare-m-2--1-tl4r6 0/1 Completed 0 30m 10.233.10.243 m-2 <none> <none> The normal monitor [root@m-0 certs.d]# kubectl exec -ti -n rook-ceph rook-ceph-mon-c-7899d87468-m9sws -- bash
Defaulted container "mon" out of: mon, chown-container-data-dir (init), init-mon-fs (init)
[root@rook-ceph-mon-c-7899d87468-m9sws ceph]# ps -ef
UID PID PPID C STIME TTY TIME CMD
ceph 1 0 0 11:54 ? 00:00:25 ceph-mon --fsid=eb5fbe58-926b-4fca-bf3e-1b3a9c35580e --keyring=/etc/ceph/keyring-store/keyring --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-
root 1961 0 0 12:39 pts/0 00:00:00 bash
root 1976 1961 0 12:39 pts/0 00:00:00 ps -ef
[root@rook-ceph-mon-c-7899d87468-m9sws ceph]# ceph --admin-daemon /run/ceph/ceph-mon.c.asok mon_status
{
"name": "c",
"rank": 2,
"state": "electing",
"election_epoch": 121,
"quorum": [],
"features": {
"required_con": "2449958747317026820",
"required_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging"
],
"quorum_con": "4540138297136906239",
"quorum_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging"
]
},
"outside_quorum": [],
"extra_probe_peers": [
{
"addrvec": [
{
"type": "v2",
"addr": "10.103.218.213:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.103.218.213:6789",
"nonce": 0
}
]
},
{
"addrvec": [
{
"type": "v2",
"addr": "10.106.1.230:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.106.1.230:6789",
"nonce": 0
}
]
}
],
"sync_provider": [],
"monmap": {
"epoch": 3,
"fsid": "eb5fbe58-926b-4fca-bf3e-1b3a9c35580e",
"modified": "2022-06-29T11:54:01.075916Z",
"created": "2022-06-29T11:51:30.882189Z",
"min_mon_release": 16,
"min_mon_release_name": "pacific",
"election_strategy": 1,
"disallowed_leaders: ": "",
"stretch_mode": false,
"tiebreaker_mon": "",
"features": {
"persistent": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "a",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.106.1.230:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.106.1.230:6789",
"nonce": 0
}
]
},
"addr": "10.106.1.230:6789/0",
"public_addr": "10.106.1.230:6789/0",
"priority": 0,
"weight": 0,
"crush_location": "{}"
},
{
"rank": 1,
"name": "b",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.103.218.213:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.103.218.213:6789",
"nonce": 0
}
]
},
"addr": "10.103.218.213:6789/0",
"public_addr": "10.103.218.213:6789/0",
"priority": 0,
"weight": 0,
"crush_location": "{}"
},
{
"rank": 2,
"name": "c",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.103.26.182:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.103.26.182:6789",
"nonce": 0
}
]
},
"addr": "10.103.26.182:6789/0",
"public_addr": "10.103.26.182:6789/0",
"priority": 0,
"weight": 0,
"crush_location": "{}"
}
]
},
"feature_map": {
"mon": [
{
"features": "0x3f01cfb9fffdffff",
"release": "luminous",
"num": 1
}
],
"osd": [
{
"features": "0x3f01cfb9fffdffff",
"release": "luminous",
"num": 3
}
],
"client": [
{
"features": "0x3f01cfb9fffdffff",
"release": "luminous",
"num": 3
}
],
"mgr": [
{
"features": "0x3f01cfb9fffdffff",
"release": "luminous",
"num": 2
}
]
},
"stretch_mode": false
} The failed monitor(running but ready): [root@m-0 certs.d]# kubectl exec -ti -n rook-ceph rook-ceph-mon-b-5cdb8ff9c7-wdgmh -- bash
Defaulted container "mon" out of: mon, chown-container-data-dir (init), init-mon-fs (init)
[root@rook-ceph-mon-b-5cdb8ff9c7-wdgmh ceph]# ps -ef
UID PID PPID C STIME TTY TIME CMD
ceph 1 0 1 12:43 ? 00:00:00 ceph-mon --fsid=eb5fbe58-926b-4fca-bf3e-1b3a9c35580e --keyring=/etc/ceph/keyring-store/keyring --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-
root 54 0 0 12:43 pts/0 00:00:00 bash
ceph 70 1 10 12:43 ? 00:00:00 ceph-mon --fsid=eb5fbe58-926b-4fca-bf3e-1b3a9c35580e --keyring=/etc/ceph/keyring-store/keyring --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-
root 71 54 0 12:43 pts/0 00:00:00 ps -ef
[root@rook-ceph-mon-b-5cdb8ff9c7-wdgmh ceph]# ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status
# command hangs
|
I can create pvc when my replicas of mon is set to 1 and probes are disabled. |
@chenlein are all mons running on the same host? |
@BlaineEXE all mons are running on the different host when the replicas is set to 3. and in this case, the osd pool can not be created. when i change replicas to 1 to redeploy rook, the pool and pvc can be created. However, during the creation of the pool, the mon's health check will fail for a few minutes or so, so disable the mon's probe is necessary. |
I see similar behavior here to this old Ceph thread: https://tracker.ceph.com/issues/11313 The key takeaway from that thread is that when the underlying system has other processes running that compete for resources, it can slow the mons down too much and cause Ceph to be unstable. I suspect this could also happen if you are running mons on |
@BlaineEXE Thanks, I'm testing on the same infrastructure,However, Everything works fine on CentOS 7.9.1810 and Ubuntu 20.04.4. I've always suspected that some kernel parameter is causing it, but can't be sure. |
I'm experiencing the sane issue as described in the first message, with Fedora CoreOS 36 virtual machines. All storage is SSD-based, all mons are running on separate nodes. Latest Rook, hypervisor is running Proxmox 7.2, and cluster is on Kubernetes 1.24.2 with Calico as CNI. |
@jqueuniet what kernel version is running on those FCOS nodes? Could you post |
I can't post a uname outright right now as our lab is out of order for network changes, but the file output should be close enough.
|
And in case this is important too, the CRI used is containerd 1.6.4 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
Is there any update on this issue? I have a similar issue as this. |
We are also encountering an issue with similar symptoms. We can't see any obvious issue in the logs, but we're happy to provide logs for any combination of containerd/kernel/other packages. |
We also had that issue in our environment using When deploying the cluster, configuring the containerd version
Environment:
|
@edef1c the PR looks to have failed some tests. Did you get a chance to look at it? |
Looks like it was just the "No impact that needs to be tracked" checkbox, just pending review and maintainer approval for further checks beyond that. |
Got same issue on AlmaLinux 9.2 with systemd-252-13.el9_2.src.rpm and 6.1.28 kernel. After trying to create pool monitor get 100% cpu on ms_dispatch:
How to fix? Create file:
Run on every servers in cluster:
After all:
|
@gleb-shchavlev Are you using cephadm to create the cluster? This issue is related to mons created by Rook. |
@travisn Yes. I created a cluster with cephadm |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
unstale |
this works! (oracle linux 9.2, k8s 1.28.1, rook ceph 1.12.3) |
Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. The glibc wrappers closefrom(3)/close_range(3) are not available before glibc 2.34, so we invoke the syscall directly. When glibc 2.34 is old enough to be a reasonable hard minimum dependency, we should switch to using closefrom. If we're not running on (recent enough) Linux, we fall back to the existing approach. Fixes: https://tracker.ceph.com/issues/59125 Signed-off-by: edef <edef@edef.eu>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
unstale |
This is a ceph issue, need to follow up on the ceph PR to be merged and included in a release... ceph/ceph#50622 |
I'm really not sure what's needed at this point, and I'm not familiar enough with the social mechanics of the Ceph project to push it along effectively. |
meet this problem also |
I'm having the same issue, looks like the PR that contains the fix is blocked ceph/ceph#50622 |
Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. The glibc wrappers closefrom(3)/close_range(3) are not available before glibc 2.34, so we invoke the syscall directly. When glibc 2.34 is old enough to be a reasonable hard minimum dependency, we should switch to using closefrom. If we're not running on (recent enough) Linux, we fall back to the existing approach. Fixes: https://tracker.ceph.com/issues/59125 Signed-off-by: edef <edef@edef.eu>
Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping. The glibc wrappers closefrom(3)/close_range(3) are not available before glibc 2.34, so we invoke the syscall directly. When glibc 2.34 is old enough to be a reasonable hard minimum dependency, we should switch to using closefrom. If we're not running on (recent enough) Linux, we fall back to the existing approach. Fixes: https://tracker.ceph.com/issues/59125 Signed-off-by: edef <edef@edef.eu>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
unstale |
Is this a bug report or feature request?
Deviation from expected behavior:
With new deployment, ceph mons keep crashing and cluster is never created completely or never in healthy state
Expected behavior:
Rook operator and cluster deployment is completed successfully.
How to reproduce it (minimal and precise):
It is difficult to say as same method or deployment is being done successfully on different kubernetes cluster however running different kubernetes release while hardware and host OS version is same.
Deployment method:
File(s) to submit:
so right after the leader mon tries to run
osd pool create
command, it crashes and that is livenessprobe kicks in whic results in pod shutting down. This happens to every mon pod which becomes leader after election.If full mon logs are required then those can be provided and also from operator. I would like to understand the process that occurs exactly after that command especially from network connectivity perspective i.e if the mon requires connecting to different pod/service which could be failing? Based on my investigation, i could not find any network connectivity issue between any of the pods or services.
Environment:
The text was updated successfully, but these errors were encountered: