Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: increase gateway liveness timeout #4482

Closed

Conversation

ashangit
Copy link
Member

@ashangit ashangit commented Dec 12, 2019

Description of your changes:
Increase the default timeout from 1s to 10s of the gateways liveness check.

Which issue is resolved by this Pull Request:
Resolves below issue:
Running a Load test, distcp from hadoop to S3 of 45To with 200 mappers running in //, on our Ceph/S3 platform based on rook 1.1.7 I observed lots of restart of the gateway components.
A first issue was due to a bug in ceph v14.2.4 corrected in v14.2.5 (ceph/ceph#29559).
But most of the restart are due to the liveness probe timeout which is based on the default setting of 1s.
In case of big loads the health checks takes more than a second to be treated and leads to the gateway beeing restarted even if it still responds to queries.
Since I've increased the timeout to 10s this issue doesn't happen anymore and the service is stable.

Checklist:

  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.
  • Comments have been added or updated based on the standards set in CONTRIBUTING.md
  • Add the flag for skipping the CI if this PR does not require a build. See here for more details.

[test ceph]

Signed-off-by: n.fraison <n.fraison@criteo.com>
@ashangit ashangit force-pushed the timeout_liveness_probe_gateway branch from ac18796 to ed619bd Compare December 12, 2019 09:36
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen many attempts like this, a lot of back and forth. This sounds reasonable to me, but I think this is only pushing the problem forward.

Can you share some of the logs from your testing?
Do you see any io blocked?
Can you share the latency= lines? So we can better understand what you observed?

Thanks.

@leseb
Copy link
Member

leseb commented Dec 12, 2019

Also #4484 might help.

@ashangit
Copy link
Member Author

@leseb agreed that tunning the timeout is not that good
#4484 is a much better approach that mine (didn't know about this healthcheck endpoint)
Here is a short extract of latencies observed: https://gist.github.com/ashangit/47fcb02d0d508364a0aad4e59e863d49
On that first test I'm fine with the performance so not having pushed the investigation on OSDs IOs or other blocking point, reaching around 2GB (read + write) with 4 servers (12 OSDs each)
Just wanted to ensure stability of it even if loaded which #4484 seems to provide

@leseb
Copy link
Member

leseb commented Dec 12, 2019

@leseb agreed that tunning the timeout is not that good
#4484 is a much better approach that mine (didn't know about this healthcheck endpoint)
Here is a short extract of latencies observed: https://gist.github.com/ashangit/47fcb02d0d508364a0aad4e59e863d49
On that first test I'm fine with the performance so not having pushed the investigation on OSDs IOs or other blocking point, reaching around 2GB (read + write) with 4 servers (12 OSDs each)
Just wanted to ensure stability of it even if loaded which #4484 seems to provide

Thanks, if #4484 solves your issue (it should IMHO) please close this. Thanks.

@ashangit ashangit closed this Dec 12, 2019
@ashangit ashangit deleted the timeout_liveness_probe_gateway branch December 12, 2019 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants