Incrementally improve retry logic in cluster health check #2507

iSignal · 2019-10-03T20:28:07Z

During a discussion with @kmuthukk, the current retry logic in the health check (as a response to #2256 )felt too fragile and biased too much towards ignoring legitimate issues. Retrying legitimate errors for extra time could take a while and cause us to miss our threshold for "fatals in last n mins"

To make this better, here's the proposal

Most health checks are of the form

ssh to node 'run_cmd' and the theory is that they fail due to ssh connectivity issues.

We make them look like

run_health_check:
ssh to node 'echo' # on failure return RETRYABLE_ERR
ssh to node 'run_cmd' # on failure return ERROR
and our overall logic would retry in these retryable cases

cc: @rao-vasireddy @bmatican for thoughts

iSignal added the area/platform Yugabyte Platform label Oct 3, 2019

iSignal self-assigned this Oct 3, 2019

iSignal mentioned this issue Oct 8, 2019

[YW] Retry failed health checks to avoid raising alerts for transient issues #2256

Closed

iSignal mentioned this issue Nov 8, 2019

Alerts and health checks improvements for 2.1 #2894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incrementally improve retry logic in cluster health check #2507

Incrementally improve retry logic in cluster health check #2507

iSignal commented Oct 3, 2019

Incrementally improve retry logic in cluster health check #2507

Incrementally improve retry logic in cluster health check #2507

Comments

iSignal commented Oct 3, 2019