-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: improve the mds liveness probe calls #12860
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's finalize the discussion in #12789 before reviewing...
85d52af
to
bcc7b2d
Compare
76c897a
to
9f6e120
Compare
@batrick can you tell me the way to intentionally remove the mds from fs map, I tried this
But it didn't succeed to remove it from fsmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, please check the Golangci linter errors and rebase to get more gree CI. Thanks
9f6e120
to
6341286
Compare
6452def
to
d7adc08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like good work, but I am sorry to say that I am not yet convinced of the approach taken here. I think it might be too complicated:
According to #12789 , the authoritative notion of health of the mds is in the mon(s), so per my understanding, implementing our own elaborate check for this somewhat violates that principle, Therefore I think we should ideally identify an extremely simple call to the mon (much more simple than fetching the fs map which in turn needs to be parsed). The result of this simple call would be the information whether this mds is deemed healthy. if that is the case, pass the probe, otherwise fail it.
I am still waiting for confirmation of my thoughts by @batrick in the issue, but already requesting a simplification of the mechanism for this PR ...
@obnoxxx The discussed approach for querying the mons is with the |
Please also see: #12789 (comment) |
The MDS will restart and come back almost instantly when removed. So maybe that's why? Did the "gid" not change for the named MDS? |
Instead of checking the socket files for mds daemon, check the mds daemon in fs maps Closes: rook#12789 Signed-off-by: parth-gr <paarora@redhat.com>
52755a3
to
5d11029
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to approve this PR, but there is one change needed in the unit test workflow regarding a pipe (|
) in the jq install config that seems like a risk for being an error.
Testing with chaos-mesh Doing a network attack to remove mds from fs dump ,
Pod get restarted
|
feedback addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small suggestions
pkg/operator/ceph/file/mds/mdsLivenessProbeTestsFiles/0FS-2MDS.json
Outdated
Show resolved
Hide resolved
9921f4b
to
c4a69f5
Compare
run: GOPATH=$(go env GOPATH) make -j $(nproc) test | ||
run: | | ||
export ROOK_UNIT_JQ_PATH="$(which jq)" | ||
GOPATH=$(go env GOPATH) make -j $(nproc) test | tee output.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the unit tests fail, this step will fail, right? Just wanted to double check that piping to tee
doesn't mess up the failure code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It woked
Signed-off-by: parth-gr <paarora@redhat.com>
f935b34
to
3d40365
Compare
mds: improve the mds liveness probe calls (backport #12860)
Description of your changes:
Instead of checking the socket files for mds daemon, check the mds daemon in fs map
Which issue is resolved by this Pull Request:
Resolves # Closes: #12789
Checklist:
skip-ci
on the PR.