Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: options for OSD restart gate #13811

Closed
anthonyeleven opened this issue Feb 22, 2024 · 6 comments · Fixed by #14040
Closed

RFE: options for OSD restart gate #13811

anthonyeleven opened this issue Feb 22, 2024 · 6 comments · Fixed by #14040
Assignees
Labels
feature good-first-issue Simple issues that are good for getting started with Rook.
Projects

Comments

@anthonyeleven
Copy link
Contributor

My understanding is that the operator gates OSD restarts via ceph osd ok-to-stop.

I would like the option to change or augment this condition so that the operator will not proceed to restart the next OSD unless all PGs are active+clean.*

We recently adjusted our K8s requests and limitssettings and experienced a subset of inactive PGs that impacted clients on an EC HDD pool. My confidence in ok-to-stop has always been limited, but I don't have a smoking gun.

Is this a bug report or feature request?

  • Feature Request

What should the feature do:

What is use case behind this feature:

Environment:

@travisn
Copy link
Member

travisn commented Feb 22, 2024

Ok, this would be even a more strict option for upgrades, such as upgradeChecksRequireActiveCleanPGs.
So you don't have evidence that ok-to-stop wasn't working, but it would help you gain more confidence in the upgrades with this extra level of assurance?

@travisn travisn added the good-first-issue Simple issues that are good for getting started with Rook. label Feb 22, 2024
@travisn travisn added this to To do in v1.14 via automation Feb 22, 2024
@anthonyeleven
Copy link
Contributor Author

anthonyeleven commented Feb 22, 2024

Exactly. This wasn't even an upgrade, it was a K8s-induced rolling restart. Since ok-to-stop as a sole gate would seem to permit OSD restarts while backfill is pending, maybe there's a race condition or something. I don't have the ability to attempt to reproduce, as the production cluster is the only one with any scale and EC HDDs.

Alternately I'd love to even be able to introduce a configurable extra delay between OSD restarts, if that would be simpler. I.e., "sleep for 300 seconds after you get ok-to-stop" before proceeding.

@mmaoyu
Copy link
Member

mmaoyu commented Mar 29, 2024

I think this feature requires check cluster if the PGs status is clean before upgrade any osds.
Implementation steps are:

  1. Add a configuration UpgradeOSDRequireHealthyPGs in crd cephcluster spec
  2. Check ceph cluster whether the stataus is clean when SkipUpgradeChecks=false and UpgradeOSDRequireHealthyPGs=true in method updateExistingOSDs before pop osd query. If status is not clean or check failed return else continue upgrade logic.

I'd like to take the issue😄

@mmaoyu
Copy link
Member

mmaoyu commented Mar 29, 2024

/assign

Copy link

Thanks for taking this issue! Let us know if you have any questions!

@anthonyeleven
Copy link
Contributor Author

Thanks for the feature! It occurs to me that I might have asked for a healthy mon quorum too but I think that's a lower risk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature good-first-issue Simple issues that are good for getting started with Rook.
Projects
v1.14
Done
Development

Successfully merging a pull request may close this issue.

3 participants