You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During a discussion with @kmuthukk, the current retry logic in the health check (as a response to #2256 )felt too fragile and biased too much towards ignoring legitimate issues. Retrying legitimate errors for extra time could take a while and cause us to miss our threshold for "fatals in last n mins"
To make this better, here's the proposal
Most health checks are of the form
ssh to node 'run_cmd' and the theory is that they fail due to ssh connectivity issues.
We make them look like
run_health_check:
ssh to node 'echo' # on failure return RETRYABLE_ERR
ssh to node 'run_cmd' # on failure return ERROR
and our overall logic would retry in these retryable cases
During a discussion with @kmuthukk, the current retry logic in the health check (as a response to #2256 )felt too fragile and biased too much towards ignoring legitimate issues. Retrying legitimate errors for extra time could take a while and cause us to miss our threshold for "fatals in last n mins"
To make this better, here's the proposal
Most health checks are of the form
ssh to node 'run_cmd' and the theory is that they fail due to ssh connectivity issues.
We make them look like
run_health_check:
ssh to node 'echo' # on failure return RETRYABLE_ERR
ssh to node 'run_cmd' # on failure return ERROR
and our overall logic would retry in these retryable cases
cc: @rao-vasireddy @bmatican for thoughts
The text was updated successfully, but these errors were encountered: