Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default startupProbe isn't long enough to allow for OSD repairs #10196

Closed
taxilian opened this issue May 2, 2022 · 11 comments · Fixed by #10250
Closed

Default startupProbe isn't long enough to allow for OSD repairs #10196

taxilian opened this issue May 2, 2022 · 11 comments · Fixed by #10250
Labels

Comments

@taxilian
Copy link
Contributor

taxilian commented May 2, 2022

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

When an OSD needs to be repaired it should be repair itself except in particularly exceptional circumstances; in my case after having some server issues where I had to restart several I had 8 different OSDs which needed to be repaired but kept getting killed by the startupProbe timeout before they finished.

How to reproduce it (minimal and precise):

I'm not sure how best to cause this, but it seems to happen most easily on my magnetic drives (the one I'm testing with is a 6TB RAID 0 array).

** Helpful logs **

debug 2022-05-02T17:23:15.747+0000 7ff4e46763c0 0 bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while) ...

** Recommendations **

I would recommend that first the startupThreshold be increased -- at least double, maybe more -- to give it time for a repair. Additionally documenting how to override those thresholds in the cluster YAML would be very good as well =]

@travisn did point me to https://rook.github.io/docs/rook/v1.9/ceph-cluster-crd.html#health-settings but maybe you could put a link to that in the template?

Thanks always for the support of your devs and Travis who has now helped me with a few issues =]

@taxilian taxilian added the bug label May 2, 2022
@spdfnet
Copy link
Contributor

spdfnet commented May 5, 2022

...nor for compactions, deep-scrubs, slow drives, etc.

I recommend disabling all probes, since they become useless over time:
How often an OSD really hangs but doesn't crash?

@travisn
Copy link
Member

travisn commented May 5, 2022

Given that Rook can't predict when the OSD will need to run those long operations at startup to disable the probes automatically, and it's painful to manually disable the probes when this issue is hit, it's a good question for whether probes should just be disabled by default for OSDs.
@leseb @satoru-takeuchi @BlaineEXE What if we disable the startup and liveness probes by default for OSDs?

@satoru-takeuchi
Copy link
Member

@travisn

What if we disable the startup and liveness probes by default for OSDs?

It looks reasonable.

@leseb
Copy link
Member

leseb commented May 6, 2022

The startup probe is probably fine to disable, however, I'd keep the liveness probe if for some reason the OSD is still running/hangs but does not respond properly.

@travisn
Copy link
Member

travisn commented May 6, 2022

The question is really how long is reasonable to allow an OSD to startup before it's considered stuck in need of restart. If we keep the liveness probe, shall we set the initial delay to 2 hours, or longer? How long might repair operations take before the OSD can respond to the liveness probe? @neha-ojha?

@neha-ojha
Copy link

The question is really how long is reasonable to allow an OSD to startup before it's considered stuck in need of restart. If we keep the liveness probe, shall we set the initial delay to 2 hours, or longer? How long might repair operations take before the OSD can respond to the liveness probe? @neha-ojha?

The issue described in #10196 (comment) is related to a new feature in Quincy (ceph/ceph#39871). The OSD now rebuilds allocation information from the store during ungraceful shutdown in order to achieve better write performance. This will lead to additional startup time, but I am curious to know how much time it took in this case and what is the current value of startupThreshold.

@taxilian
Copy link
Contributor Author

taxilian commented May 8, 2022

Not claiming that this is statistically significant, but after disabling my liveness and startup checks I did have one OSD go weird on me today where it was down but didn't crash... so it at least can happen =]

granted, my cluster is in a very foobarred state right now, so there could be many factors at play.

@spdfnet
Copy link
Contributor

spdfnet commented May 9, 2022

Not claiming that this is statistically significant, but after disabling my liveness and startup checks I did have one OSD go weird on me today where it was down but didn't crash... so it at least can happen =]

granted, my cluster is in a very foobarred state right now, so there could be many factors at play.

I don't think restarting it every 5 minutes would have made it any better.
IMO this kind of things need manual intervention.

@travisn
Copy link
Member

travisn commented May 9, 2022

The question is really how long is reasonable to allow an OSD to startup before it's considered stuck in need of restart. If we keep the liveness probe, shall we set the initial delay to 2 hours, or longer? How long might repair operations take before the OSD can respond to the liveness probe? @neha-ojha?

The issue described in #10196 (comment) is related to a new feature in Quincy (ceph/ceph#39871). The OSD now rebuilds allocation information from the store during ungraceful shutdown in order to achieve better write performance. This will lead to additional startup time, but I am curious to know how much time it took in this case and what is the current value of startupThreshold.

The default startup probe currently only gives 90s for the OSD to start and respond on the admin port before it assumes the OSD needs to be restarted.

@BlaineEXE
Copy link
Member

BlaineEXE commented May 11, 2022

The startup probe is probably fine to disable, however, I'd keep the liveness probe if for some reason the OSD is still running/hangs but does not respond properly.

In order to keep the liveness probe at a reasonable timeout, wouldn't we need to know when the OSD is actually started? The startup and liveness probes are the same, so once the OSD process is running, the liveness probe will start unless we have the readiness probe. Instead of disabling the readiness probe, I wonder if it would be better to set the liveness probe to a very large value. Would 90 minutes be sufficient? Too much?

@travisn
Copy link
Member

travisn commented May 11, 2022

The startup probe is probably fine to disable, however, I'd keep the liveness probe if for some reason the OSD is still running/hangs but does not respond properly.

In order to keep the liveness probe at a reasonable timeout, wouldn't we need to know when the OSD is actually started? The startup and liveness probes are the same, so once the OSD process is running, the liveness probe will start unless we have the readiness probe. Instead of disabling the readiness probe, I wonder if it would be better to set the liveness probe to a very large value. Would 90 minutes be sufficient? Too much?

Or similarly, what if we keep both probes, but set the startupProbe to wait for 90 or even 120 minutes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants