Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: improve the mds liveness probe calls #12860

Merged
merged 2 commits into from
Oct 10, 2023
Merged

Conversation

parth-gr
Copy link
Member

@parth-gr parth-gr commented Sep 6, 2023

Description of your changes:

Instead of checking the socket files for mds daemon, check the mds daemon in fs map

Which issue is resolved by this Pull Request:
Resolves # Closes: #12789

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's finalize the discussion in #12789 before reviewing...

@parth-gr
Copy link
Member Author

@batrick can you tell me the way to intentionally remove the mds from fs map,
So I can do the false testing of this feature

I tried this

sh-4.4$ ceph mds fail myfs-b
failed mds gid 41539

But it didn't succeed to remove it from fsmap

Copy link
Contributor

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, please check the Golangci linter errors and rebase to get more gree CI. Thanks

pkg/operator/ceph/controller/spec.go Outdated Show resolved Hide resolved
pkg/operator/ceph/controller/spec.go Outdated Show resolved Hide resolved
Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parth-gr,

This looks like good work, but I am sorry to say that I am not yet convinced of the approach taken here. I think it might be too complicated:

According to #12789 , the authoritative notion of health of the mds is in the mon(s), so per my understanding, implementing our own elaborate check for this somewhat violates that principle, Therefore I think we should ideally identify an extremely simple call to the mon (much more simple than fetching the fs map which in turn needs to be parsed). The result of this simple call would be the information whether this mds is deemed healthy. if that is the case, pass the probe, otherwise fail it.

I am still waiting for confirmation of my thoughts by @batrick in the issue, but already requesting a simplification of the mechanism for this PR ...

@travisn
Copy link
Member

travisn commented Sep 15, 2023

@obnoxxx The discussed approach for querying the mons is with the ceph fs dump command, which the liveness probe is executing and then parsing the results, so it looks expected to me.

@batrick
Copy link
Contributor

batrick commented Sep 15, 2023

Please also see: #12789 (comment)

@batrick
Copy link
Contributor

batrick commented Sep 15, 2023

@batrick can you tell me the way to intentionally remove the mds from fs map, So I can do the false testing of this feature

I tried this

sh-4.4$ ceph mds fail myfs-b
failed mds gid 41539

But it didn't succeed to remove it from fsmap

The MDS will restart and come back almost instantly when removed. So maybe that's why? Did the "gid" not change for the named MDS?

Instead of checking the socket files for mds daemon,
check the mds daemon in fs maps

Closes: rook#12789

Signed-off-by: parth-gr <paarora@redhat.com>
@parth-gr parth-gr force-pushed the mds-liveness branch 9 times, most recently from 52755a3 to 5d11029 Compare October 4, 2023 20:45
Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to approve this PR, but there is one change needed in the unit test workflow regarding a pipe (|) in the jq install config that seems like a risk for being an error.

.github/workflows/unit-test.yml Outdated Show resolved Hide resolved
.github/workflows/unit-test.yml Outdated Show resolved Hide resolved
@parth-gr
Copy link
Member Author

parth-gr commented Oct 5, 2023

Testing with chaos-mesh

Doing a network attack to remove mds from fs dump ,
Liveness probe:

  Normal   Scheduled         28m                 default-scheduler  Successfully assigned rook-ceph/rook-ceph-mds-myfs-a-a-7cc44bf9bd-5cgkt to minikube
  Normal   Pulled            28m                 kubelet            Container image "quay.io/ceph/ceph:v17.2.6" already present on machine
  Normal   Created           28m                 kubelet            Created container chown-container-data-dir
  Normal   Started           28m                 kubelet            Started container chown-container-data-dir
  Normal   Created           28m                 kubelet            Created container log-collector
  Normal   Pulled            28m                 kubelet            Container image "quay.io/ceph/ceph:v17.2.6" already present on machine
  Normal   Started           28m                 kubelet            Started container log-collector
  Normal   Killing           93s                 kubelet            Container mds failed liveness probe, will be restarted
  Normal   Created           92s (x2 over 28m)   kubelet            Created container mds
  Normal   Started           92s (x2 over 28m)   kubelet            Started container mds
  Normal   Pulled            92s (x2 over 28m)   kubelet            Container image "quay.io/ceph/ceph:v17.2.6" already present on machine
  Warning  Unhealthy         3s (x7 over 3m33s)  kubelet            Liveness probe failed:
[rider@localhost examples]$ 

Pod get restarted

rook-ceph-mds-myfs-a-a-7cc44bf9bd-5cgkt                  2/2     Running     1 (25s ago)     33m

@travisn travisn dismissed stale reviews from obnoxxx and subhamkrai October 9, 2023 23:33

feedback addressed

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions

.github/workflows/unit-test.yml Outdated Show resolved Hide resolved
.github/workflows/unit-test.yml Show resolved Hide resolved
run: GOPATH=$(go env GOPATH) make -j $(nproc) test
run: |
export ROOK_UNIT_JQ_PATH="$(which jq)"
GOPATH=$(go env GOPATH) make -j $(nproc) test | tee output.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the unit tests fail, this step will fail, right? Just wanted to double check that piping to tee doesn't mess up the failure code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It woked

.github/workflows/unit-test.yml Outdated Show resolved Hide resolved
Signed-off-by: parth-gr <paarora@redhat.com>
@parth-gr parth-gr force-pushed the mds-liveness branch 2 times, most recently from f935b34 to 3d40365 Compare October 10, 2023 13:34
@travisn travisn merged commit 703f594 into rook:master Oct 10, 2023
50 checks passed
travisn added a commit that referenced this pull request Oct 10, 2023
mds: improve the mds liveness probe calls (backport #12860)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use monitors to determine MDS liveness
6 participants