You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In various tests, we always observe continuous unavailability when injecting IO latency and other chaos into the PD leader node., like #6291. Upon further investigation of the logs, we discovered that the detection and eviction of unhealthy nodes are not always accurate. As a result, problematic etcd nodes can persistently impact our requests due to the round-robin balancer used by the etcd client. Especially during the leader switch, this problem could cause the PD leader unable to be stabilized and prolong the election, which affects the availability a lot.
For now, we just use 10 seconds as the timeout for the health check, it's kind of too loose in some cases. We require a more precise detection mechanism to promptly remove an unhealthy etcd node from the available endpoints and prevent it from rejoining before it truly recovers.
The text was updated successfully, but these errors were encountered:
…#7737)
ref #7730
Consider the latency while patrolling the healthy endpoints to reduce the effect of slow nodes.
Now, there are the following strategies to select and remove unhealthy endpoints:
- Choose only the healthy endpoint within the lowest acceptable latency range.
- The evicted endpoint can only rejoin if it is selected again for three consecutive times.
Signed-off-by: JmPotato <ghzpotato@gmail.com>
ref #7499, ref #7730
Return the originally picked endpoints directly if all are evicted to gain better availability.
Signed-off-by: JmPotato <ghzpotato@gmail.com>
Part of #7499.
In various tests, we always observe continuous unavailability when injecting IO latency and other chaos into the PD leader node., like #6291. Upon further investigation of the logs, we discovered that the detection and eviction of unhealthy nodes are not always accurate. As a result, problematic etcd nodes can persistently impact our requests due to the round-robin balancer used by the etcd client. Especially during the leader switch, this problem could cause the PD leader unable to be stabilized and prolong the election, which affects the availability a lot.
For now, we just use 10 seconds as the timeout for the health check, it's kind of too loose in some cases. We require a more precise detection mechanism to promptly remove an unhealthy etcd node from the available endpoints and prevent it from rejoining before it truly recovers.
The text was updated successfully, but these errors were encountered: