New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default startupProbe isn't long enough to allow for OSD repairs #10196
Comments
...nor for compactions, deep-scrubs, slow drives, etc. I recommend disabling all probes, since they become useless over time: |
Given that Rook can't predict when the OSD will need to run those long operations at startup to disable the probes automatically, and it's painful to manually disable the probes when this issue is hit, it's a good question for whether probes should just be disabled by default for OSDs. |
It looks reasonable. |
The startup probe is probably fine to disable, however, I'd keep the liveness probe if for some reason the OSD is still running/hangs but does not respond properly. |
The question is really how long is reasonable to allow an OSD to startup before it's considered stuck in need of restart. If we keep the liveness probe, shall we set the initial delay to 2 hours, or longer? How long might repair operations take before the OSD can respond to the liveness probe? @neha-ojha? |
The issue described in #10196 (comment) is related to a new feature in Quincy (ceph/ceph#39871). The OSD now rebuilds allocation information from the store during ungraceful shutdown in order to achieve better write performance. This will lead to additional startup time, but I am curious to know how much time it took in this case and what is the current value of startupThreshold. |
Not claiming that this is statistically significant, but after disabling my liveness and startup checks I did have one OSD go weird on me today where it was down but didn't crash... so it at least can happen =] granted, my cluster is in a very foobarred state right now, so there could be many factors at play. |
I don't think restarting it every 5 minutes would have made it any better. |
The default startup probe currently only gives 90s for the OSD to start and respond on the admin port before it assumes the OSD needs to be restarted. |
In order to keep the liveness probe at a reasonable timeout, wouldn't we need to know when the OSD is actually started? The startup and liveness probes are the same, so once the OSD process is running, the liveness probe will start unless we have the readiness probe. Instead of disabling the readiness probe, I wonder if it would be better to set the liveness probe to a very large value. Would 90 minutes be sufficient? Too much? |
Or similarly, what if we keep both probes, but set the startupProbe to wait for 90 or even 120 minutes? |
Is this a bug report or feature request?
Deviation from expected behavior:
When an OSD needs to be repaired it should be repair itself except in particularly exceptional circumstances; in my case after having some server issues where I had to restart several I had 8 different OSDs which needed to be repaired but kept getting killed by the startupProbe timeout before they finished.
How to reproduce it (minimal and precise):
I'm not sure how best to cause this, but it seems to happen most easily on my magnetic drives (the one I'm testing with is a 6TB RAID 0 array).
** Helpful logs **
debug 2022-05-02T17:23:15.747+0000 7ff4e46763c0 0 bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while) ...
** Recommendations **
I would recommend that first the startupThreshold be increased -- at least double, maybe more -- to give it time for a repair. Additionally documenting how to override those thresholds in the cluster YAML would be very good as well =]
@travisn did point me to https://rook.github.io/docs/rook/v1.9/ceph-cluster-crd.html#health-settings but maybe you could put a link to that in the template?
Thanks always for the support of your devs and Travis who has now helped me with a few issues =]
The text was updated successfully, but these errors were encountered: