-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: set blocking PDB even if no unhealthy PGs appear #13511
Merged
travisn
merged 1 commit into
rook:master
from
ushitora-anqou:set-pdb-even-if-pgs-remain-active-clean
Feb 6, 2024
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the PGs are clean, the intent is that the PDBs will again reset after some timeout. Perhaps 30s was just too short. What about a timeout of 5 or 10 minutes? @sp98 thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case osdDown && pgClean:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the whole idea of the 30 second time period is to give enough time to Rook to read the
pg status
from correctly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the OSD is down but PGs are acitive+clean, why would we want to even enable the blocking PDBs? Shouldn't we allow the next OSD to drain as data is safe? @ushitora-anqou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sp98 If there are multiple OSDs on the node, the only way for a node to be drained is if the blocking PDBs are applied to other nodes or zones to allow this one to drain. If I'm understanding the timeout correctly, perhaps we need a longer timeout if there are multiple OSDs on a node, so the local node/zone can be allowed to drain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ushitora-anqou Thanks for the details. Can you please confirm the pdb behavior for the following scenarios as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sp98 The blocking PDBs were successfully created in the both scenarios. The details are: https://github.com/ushitora-anqou/rook-on-minikube/tree/4332b90d438c2a919bef3e9bef8f4041795e913d?tab=readme-ov-file#additional-investigation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed testing @ushitora-anqou . Really appreciate the effort you have put into testing this.
@travisn The PR looks good to me so approving this. I don't think there will be any regression due to this change (fingers crossed). But still requesting your approval before we merge this, just ensure that I'm not missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the changes in this PR, but I want to be clear about the behavior change. If any OSD pod is down, now we would expect the blocking PDBs to be in place. Even if the PGs become active+clean, the blocking PDBs will always be enabled. This means that if an OSD is down due to a failed disk, that OSD would either have to be repaired or purged before the PDBs will be reset again.
This behavior is simple and intuitive to me, but it is different from the previous behavior that would allow the PDBs to be reset after some time even if an OSD disk fails and Ceph backfills its data to other OSDs.
Is that correct, or any other clarification needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems correct to me.
Yes, it is. My PR claims that the previous behavior is problematic because the reset PDBs doesn't allow the OSDs that can be drained to actually be drained.