Default startupProbe isn't long enough to allow for OSD repairs #10196

taxilian · 2022-05-02T17:40:58Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

When an OSD needs to be repaired it should be repair itself except in particularly exceptional circumstances; in my case after having some server issues where I had to restart several I had 8 different OSDs which needed to be repaired but kept getting killed by the startupProbe timeout before they finished.

How to reproduce it (minimal and precise):

I'm not sure how best to cause this, but it seems to happen most easily on my magnetic drives (the one I'm testing with is a 6TB RAID 0 array).

** Helpful logs **

debug 2022-05-02T17:23:15.747+0000 7ff4e46763c0 0 bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while) ...

** Recommendations **

I would recommend that first the startupThreshold be increased -- at least double, maybe more -- to give it time for a repair. Additionally documenting how to override those thresholds in the cluster YAML would be very good as well =]

@travisn did point me to https://rook.github.io/docs/rook/v1.9/ceph-cluster-crd.html#health-settings but maybe you could put a link to that in the template?

Thanks always for the support of your devs and Travis who has now helped me with a few issues =]

The text was updated successfully, but these errors were encountered:

spdfnet · 2022-05-05T14:29:23Z

...nor for compactions, deep-scrubs, slow drives, etc.

I recommend disabling all probes, since they become useless over time:
How often an OSD really hangs but doesn't crash?

travisn · 2022-05-05T17:18:10Z

Given that Rook can't predict when the OSD will need to run those long operations at startup to disable the probes automatically, and it's painful to manually disable the probes when this issue is hit, it's a good question for whether probes should just be disabled by default for OSDs.
@leseb @satoru-takeuchi @BlaineEXE What if we disable the startup and liveness probes by default for OSDs?

satoru-takeuchi · 2022-05-06T01:49:52Z

@travisn

What if we disable the startup and liveness probes by default for OSDs?

It looks reasonable.

leseb · 2022-05-06T07:59:29Z

The startup probe is probably fine to disable, however, I'd keep the liveness probe if for some reason the OSD is still running/hangs but does not respond properly.

travisn · 2022-05-06T15:32:42Z

The question is really how long is reasonable to allow an OSD to startup before it's considered stuck in need of restart. If we keep the liveness probe, shall we set the initial delay to 2 hours, or longer? How long might repair operations take before the OSD can respond to the liveness probe? @neha-ojha?

neha-ojha · 2022-05-06T23:00:41Z

The question is really how long is reasonable to allow an OSD to startup before it's considered stuck in need of restart. If we keep the liveness probe, shall we set the initial delay to 2 hours, or longer? How long might repair operations take before the OSD can respond to the liveness probe? @neha-ojha?

The issue described in #10196 (comment) is related to a new feature in Quincy (ceph/ceph#39871). The OSD now rebuilds allocation information from the store during ungraceful shutdown in order to achieve better write performance. This will lead to additional startup time, but I am curious to know how much time it took in this case and what is the current value of startupThreshold.

taxilian · 2022-05-08T06:21:25Z

Not claiming that this is statistically significant, but after disabling my liveness and startup checks I did have one OSD go weird on me today where it was down but didn't crash... so it at least can happen =]

granted, my cluster is in a very foobarred state right now, so there could be many factors at play.

spdfnet · 2022-05-09T10:06:12Z

Not claiming that this is statistically significant, but after disabling my liveness and startup checks I did have one OSD go weird on me today where it was down but didn't crash... so it at least can happen =]

granted, my cluster is in a very foobarred state right now, so there could be many factors at play.

I don't think restarting it every 5 minutes would have made it any better.
IMO this kind of things need manual intervention.

travisn · 2022-05-09T17:02:57Z

The question is really how long is reasonable to allow an OSD to startup before it's considered stuck in need of restart. If we keep the liveness probe, shall we set the initial delay to 2 hours, or longer? How long might repair operations take before the OSD can respond to the liveness probe? @neha-ojha?

The issue described in #10196 (comment) is related to a new feature in Quincy (ceph/ceph#39871). The OSD now rebuilds allocation information from the store during ungraceful shutdown in order to achieve better write performance. This will lead to additional startup time, but I am curious to know how much time it took in this case and what is the current value of startupThreshold.

The default startup probe currently only gives 90s for the OSD to start and respond on the admin port before it assumes the OSD needs to be restarted.

BlaineEXE · 2022-05-11T19:48:29Z

The startup probe is probably fine to disable, however, I'd keep the liveness probe if for some reason the OSD is still running/hangs but does not respond properly.

In order to keep the liveness probe at a reasonable timeout, wouldn't we need to know when the OSD is actually started? The startup and liveness probes are the same, so once the OSD process is running, the liveness probe will start unless we have the readiness probe. Instead of disabling the readiness probe, I wonder if it would be better to set the liveness probe to a very large value. Would 90 minutes be sufficient? Too much?

travisn · 2022-05-11T20:12:35Z

The startup probe is probably fine to disable, however, I'd keep the liveness probe if for some reason the OSD is still running/hangs but does not respond properly.

In order to keep the liveness probe at a reasonable timeout, wouldn't we need to know when the OSD is actually started? The startup and liveness probes are the same, so once the OSD process is running, the liveness probe will start unless we have the readiness probe. Instead of disabling the readiness probe, I wonder if it would be better to set the liveness probe to a very large value. Would 90 minutes be sufficient? Too much?

Or similarly, what if we keep both probes, but set the startupProbe to wait for 90 or even 120 minutes?

taxilian added the bug label May 2, 2022

travisn mentioned this issue May 11, 2022

osd: Allow the osd to take a long time to start #10250

Merged

7 tasks

travisn closed this as completed in #10250 May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default startupProbe isn't long enough to allow for OSD repairs #10196

Default startupProbe isn't long enough to allow for OSD repairs #10196

taxilian commented May 2, 2022

spdfnet commented May 5, 2022

travisn commented May 5, 2022

satoru-takeuchi commented May 6, 2022

leseb commented May 6, 2022

travisn commented May 6, 2022

neha-ojha commented May 6, 2022

taxilian commented May 8, 2022

spdfnet commented May 9, 2022

travisn commented May 9, 2022

BlaineEXE commented May 11, 2022 •

edited

travisn commented May 11, 2022

Default startupProbe isn't long enough to allow for OSD repairs #10196

Default startupProbe isn't long enough to allow for OSD repairs #10196

Comments

taxilian commented May 2, 2022

spdfnet commented May 5, 2022

travisn commented May 5, 2022

satoru-takeuchi commented May 6, 2022

leseb commented May 6, 2022

travisn commented May 6, 2022

neha-ojha commented May 6, 2022

taxilian commented May 8, 2022

spdfnet commented May 9, 2022

travisn commented May 9, 2022

BlaineEXE commented May 11, 2022 • edited

travisn commented May 11, 2022

BlaineEXE commented May 11, 2022 •

edited