RFE: options for OSD restart gate #13811

anthonyeleven · 2024-02-22T16:03:22Z

My understanding is that the operator gates OSD restarts via ceph osd ok-to-stop.

I would like the option to change or augment this condition so that the operator will not proceed to restart the next OSD unless all PGs are active+clean.*

We recently adjusted our K8s requests and limitssettings and experienced a subset of inactive PGs that impacted clients on an EC HDD pool. My confidence in ok-to-stop has always been limited, but I don't have a smoking gun.

Is this a bug report or feature request?

Feature Request

What should the feature do:

What is use case behind this feature:

Environment:

The text was updated successfully, but these errors were encountered:

travisn · 2024-02-22T17:58:52Z

Ok, this would be even a more strict option for upgrades, such as upgradeChecksRequireActiveCleanPGs.
So you don't have evidence that ok-to-stop wasn't working, but it would help you gain more confidence in the upgrades with this extra level of assurance?

anthonyeleven · 2024-02-22T18:07:46Z

Exactly. This wasn't even an upgrade, it was a K8s-induced rolling restart. Since ok-to-stop as a sole gate would seem to permit OSD restarts while backfill is pending, maybe there's a race condition or something. I don't have the ability to attempt to reproduce, as the production cluster is the only one with any scale and EC HDDs.

Alternately I'd love to even be able to introduce a configurable extra delay between OSD restarts, if that would be simpler. I.e., "sleep for 300 seconds after you get ok-to-stop" before proceeding.

mmaoyu · 2024-03-29T07:25:52Z

I think this feature requires check cluster if the PGs status is clean before upgrade any osds.
Implementation steps are:

Add a configuration UpgradeOSDRequireHealthyPGs in crd cephcluster spec
Check ceph cluster whether the stataus is clean when SkipUpgradeChecks=false and UpgradeOSDRequireHealthyPGs=true in method updateExistingOSDs before pop osd query. If status is not clean or check failed return else continue upgrade logic.

I'd like to take the issue😄

mmaoyu · 2024-03-29T07:26:33Z

/assign

github-actions · 2024-03-29T07:26:43Z

Thanks for taking this issue! Let us know if you have any questions!

anthonyeleven · 2024-04-18T17:37:38Z

Thanks for the feature! It occurs to me that I might have asked for a healthy mon quorum too but I think that's a lower risk.

anthonyeleven added the feature label Feb 22, 2024

travisn added the good-first-issue Simple issues that are good for getting started with Rook. label Feb 22, 2024

travisn added this to To do in v1.14 via automation Feb 22, 2024

github-actions bot assigned mmaoyu Mar 29, 2024

mmaoyu mentioned this issue Apr 8, 2024

osd: add option upgradeOSDRequiresHealthyPGs #14040

Merged

6 tasks

BlaineEXE closed this as completed in #14040 Apr 11, 2024

v1.14 automation moved this from To do to Done Apr 11, 2024

mergify bot mentioned this issue Apr 11, 2024

osd: add option upgradeOSDRequiresHealthyPGs (backport #14040) #14061

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: options for OSD restart gate #13811

RFE: options for OSD restart gate #13811

anthonyeleven commented Feb 22, 2024

travisn commented Feb 22, 2024

anthonyeleven commented Feb 22, 2024 •

edited

mmaoyu commented Mar 29, 2024

mmaoyu commented Mar 29, 2024

github-actions bot commented Mar 29, 2024

anthonyeleven commented Apr 18, 2024

RFE: options for OSD restart gate #13811

RFE: options for OSD restart gate #13811

Comments

anthonyeleven commented Feb 22, 2024

travisn commented Feb 22, 2024

anthonyeleven commented Feb 22, 2024 • edited

mmaoyu commented Mar 29, 2024

mmaoyu commented Mar 29, 2024

github-actions bot commented Mar 29, 2024

anthonyeleven commented Apr 18, 2024

anthonyeleven commented Feb 22, 2024 •

edited