Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: increase liveness probe timeout to 2s #10460

Merged
merged 1 commit into from Jun 17, 2022

Conversation

subhamkrai
Copy link
Contributor

we have noticed multiple failures because of the probe
failing, most of the time it's due to fewer resources.
But increasing timeout fixes that, so increasing the
probe timeout to 2s from default 1s so that it will
give more time to probe before failing.

Signed-off-by: subhamkrai srai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user overrides the liveness probe settings, this timeout can still be overridden, right?

pkg/operator/ceph/controller/spec.go Outdated Show resolved Hide resolved
@subhamkrai
Copy link
Contributor Author

If the user overrides the liveness probe settings, this timeout can still be overridden, right?

I'll test and update but I think we should be able to override.

we have noticed multiple failures because of the probe
failing, most of the time it's due to fewer resources.
But increasing timeout fixes that, so increasing the
probe timeout to 2s from default 1s so that it will
give more time to probe before failing.

Signed-off-by: subhamkrai <srai@redhat.com>
@subhamkrai
Copy link
Contributor Author

If the user overrides the liveness probe settings, this timeout can still be overridden, right?

I'll test and update but I think we should be able to override.

I changed the mon deployment with timeout 5 and it was modified but when I restarted the operator, the timeout value changed to 2, and the respective ceph pod(in my case mon pod) restart and had timeout 2

@subhamkrai
Copy link
Contributor Author

I think with the above observation, it will be problematic because somehow the operator restarts then it will bring the timeout to 2 and the respective ceph pod will restart but it will be in CLBO state due to liveness probe error. thoughts @travisn

@subhamkrai subhamkrai requested a review from travisn June 17, 2022 13:51
@travisn
Copy link
Member

travisn commented Jun 17, 2022

If the user overrides the liveness probe settings, this timeout can still be overridden, right?

I'll test and update but I think we should be able to override.

I changed the mon deployment with timeout 5 and it was modified but when I restarted the operator, the timeout value changed to 2, and the respective ceph pod(in my case mon pod) restart and had timeout 2

This certainly sounds unexpected. You're sure all you did was restart the operator? How about trying a few variations to narrow down the repro:

  1. Create a new cluster with the default timeout, then update the cluster CR with a new timeout
  2. Restart the operator
  3. Create a new cluster with the modified timeout
  4. Restart the operator

At each step, see if the timeout is expected. It sounds like you already ran the first two steps. How about the other steps?

@subhamkrai
Copy link
Contributor Author

If the user overrides the liveness probe settings, this timeout can still be overridden, right?

I'll test and update but I think we should be able to override.

I changed the mon deployment with timeout 5 and it was modified but when I restarted the operator, the timeout value changed to 2, and the respective ceph pod(in my case mon pod) restart and had timeout 2

This certainly sounds unexpected. You're sure all you did was restart the operator?
I created a cluster with 2s and modified deployment of mon to timeout 5s, restarted operator pod and a timeout was back to 2s.
How about trying a few variations to narrow down the repro:

  1. Create a new cluster with the default timeout, then update the cluster CR with a new timeout
    default timeout 2s(not k8s default 1s)?
  2. Restart the operator
  3. Create a new cluster with the modified timeout
  4. Restart the operator

At each step, see if the timeout is expected. It sounds like you already ran the first two steps. How about the other steps?

question: after step 2, you said creating a new cluster means deleting the cluster.yaml file and creating again?

@travisn
Copy link
Member

travisn commented Jun 17, 2022

question: after step 2, you said creating a new cluster means deleting the cluster.yaml file and creating again?

Per offline discussion, sounds like our confusion was about updating the cluster CR instead of editing the deployment manually. Sounds like the behavior was expected since the deployment was modified manually.

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me after the last discussion right?

@travisn
Copy link
Member

travisn commented Jun 17, 2022

Seems good to me after the last discussion right?

Agreed, will go ahead with it.

@travisn travisn merged commit 5806c21 into rook:master Jun 17, 2022
48 checks passed
mergify bot added a commit that referenced this pull request Jun 19, 2022
core: increase liveness probe timeout to 2s (backport #10460)
@subhamkrai subhamkrai deleted the increase-probe-timeout branch June 20, 2022 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants