ceph: increase gateway liveness timeout #4482

ashangit · 2019-12-12T09:32:16Z

Description of your changes:
Increase the default timeout from 1s to 10s of the gateways liveness check.

Which issue is resolved by this Pull Request:
Resolves below issue:
Running a Load test, distcp from hadoop to S3 of 45To with 200 mappers running in //, on our Ceph/S3 platform based on rook 1.1.7 I observed lots of restart of the gateway components.
A first issue was due to a bug in ceph v14.2.4 corrected in v14.2.5 (ceph/ceph#29559).
But most of the restart are due to the liveness probe timeout which is based on the default setting of 1s.
In case of big loads the health checks takes more than a second to be treated and leads to the gateway beeing restarted even if it still responds to queries.
Since I've increased the timeout to 10s this issue doesn't happen anymore and the service is stable.

Checklist:

Reviewed the developer guide on Submitting a Pull Request
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.
Pending release notes updated with breaking and/or notable changes, if necessary.
Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
Code generation (make codegen) has been run to update object specifications, if necessary.
Comments have been added or updated based on the standards set in CONTRIBUTING.md
Add the flag for skipping the CI if this PR does not require a build. See here for more details.

[test ceph]

Signed-off-by: n.fraison <n.fraison@criteo.com>

leseb

We have seen many attempts like this, a lot of back and forth. This sounds reasonable to me, but I think this is only pushing the problem forward.

Can you share some of the logs from your testing?
Do you see any io blocked?
Can you share the latency= lines? So we can better understand what you observed?

Thanks.

leseb · 2019-12-12T10:59:40Z

Also #4484 might help.

ashangit · 2019-12-12T13:48:15Z

@leseb agreed that tunning the timeout is not that good
#4484 is a much better approach that mine (didn't know about this healthcheck endpoint)
Here is a short extract of latencies observed: https://gist.github.com/ashangit/47fcb02d0d508364a0aad4e59e863d49
On that first test I'm fine with the performance so not having pushed the investigation on OSDs IOs or other blocking point, reaching around 2GB (read + write) with 4 servers (12 OSDs each)
Just wanted to ensure stability of it even if loaded which #4484 seems to provide

leseb · 2019-12-12T15:29:18Z

@leseb agreed that tunning the timeout is not that good
#4484 is a much better approach that mine (didn't know about this healthcheck endpoint)
Here is a short extract of latencies observed: https://gist.github.com/ashangit/47fcb02d0d508364a0aad4e59e863d49
On that first test I'm fine with the performance so not having pushed the investigation on OSDs IOs or other blocking point, reaching around 2GB (read + write) with 4 servers (12 OSDs each)
Just wanted to ensure stability of it even if loaded which #4484 seems to provide

Thanks, if #4484 solves your issue (it should IMHO) please close this. Thanks.

ceph: increase gateway liveness timeout

ed619bd

Signed-off-by: n.fraison <n.fraison@criteo.com>

ashangit force-pushed the timeout_liveness_probe_gateway branch from ac18796 to ed619bd Compare December 12, 2019 09:36

leseb requested changes Dec 12, 2019

View reviewed changes

ashangit closed this Dec 12, 2019

ashangit deleted the timeout_liveness_probe_gateway branch December 12, 2019 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceph: increase gateway liveness timeout #4482

ceph: increase gateway liveness timeout #4482

ashangit commented Dec 12, 2019 •

edited by leseb

leseb left a comment

leseb commented Dec 12, 2019

ashangit commented Dec 12, 2019

leseb commented Dec 12, 2019

ceph: increase gateway liveness timeout #4482

ceph: increase gateway liveness timeout #4482

Conversation

ashangit commented Dec 12, 2019 • edited by leseb

leseb left a comment

Choose a reason for hiding this comment

leseb commented Dec 12, 2019

ashangit commented Dec 12, 2019

leseb commented Dec 12, 2019

ashangit commented Dec 12, 2019 •

edited by leseb