Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incrementally improve retry logic in cluster health check #2507

Open
iSignal opened this issue Oct 3, 2019 · 0 comments
Open

Incrementally improve retry logic in cluster health check #2507

iSignal opened this issue Oct 3, 2019 · 0 comments
Assignees
Labels
area/platform Yugabyte Platform

Comments

@iSignal
Copy link
Contributor

iSignal commented Oct 3, 2019

During a discussion with @kmuthukk, the current retry logic in the health check (as a response to #2256 )felt too fragile and biased too much towards ignoring legitimate issues. Retrying legitimate errors for extra time could take a while and cause us to miss our threshold for "fatals in last n mins"

To make this better, here's the proposal

Most health checks are of the form

ssh to node 'run_cmd' and the theory is that they fail due to ssh connectivity issues.

We make them look like

run_health_check:
ssh to node 'echo' # on failure return RETRYABLE_ERR
ssh to node 'run_cmd' # on failure return ERROR
and our overall logic would retry in these retryable cases

cc: @rao-vasireddy @bmatican for thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform Yugabyte Platform
Projects
None yet
Development

No branches or pull requests

1 participant