Enhance the detection mechanism for the unhealthy etcd node #7730

JmPotato · 2024-01-18T04:04:08Z

Part of #7499.

In various tests, we always observe continuous unavailability when injecting IO latency and other chaos into the PD leader node., like #6291. Upon further investigation of the logs, we discovered that the detection and eviction of unhealthy nodes are not always accurate. As a result, problematic etcd nodes can persistently impact our requests due to the round-robin balancer used by the etcd client. Especially during the leader switch, this problem could cause the PD leader unable to be stabilized and prolong the election, which affects the availability a lot.

2024-01-17 12:42:26 | {"level":"INFO","namespace":"endless-ha-test-tps-6150264-1-163","pod":"tc-pd-1","container":"pd","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]"} | 
2024-01-17 12:41:33 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:41:33 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:41:31 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:41:31 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:40:47 | {"level":"INFO","pod":"tc-pd-2","container":"pd","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:40:47 | {"level":"INFO","namespace":"endless-ha-test-tps-6150264-1-163","pod":"tc-pd-1","container":"pd","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]"} |  
2024-01-17 12:40:46 | {"level":"INFO","namespace":"endless-ha-test-tps-6150264-1-163","pod":"tc-pd-1","container":"pd","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]"} | 
2024-01-17 12:40:45 | {"level":"INFO","pod":"tc-pd-2","container":"pd","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:40:45 | {"level":"INFO","pod":"tc-pd-2","container":"pd","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:40:39 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:40:39 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:40:31 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:40:31 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=3->2] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:39:27 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"} |  
2024-01-17 12:39:27 | {"pod":"tc-pd-0","container":"pd","level":"INFO","log":"[etcdutil.go:314] [\"update endpoints\"] [num-change=2->3] [last-endpoints=\"[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"] [endpoints=\"[http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-6150264-1-163.svc:2379]\"]","namespace":"endless-ha-test-tps-6150264-1-163"}

For now, we just use 10 seconds as the timeout for the health check, it's kind of too loose in some cases. We require a more precise detection mechanism to promptly remove an unhealthy etcd node from the available endpoints and prevent it from rejoining before it truly recovers.

ref #7730 Move the health checker into a separate file. Signed-off-by: JmPotato <ghzpotato@gmail.com>

…#7737) ref #7730 Consider the latency while patrolling the healthy endpoints to reduce the effect of slow nodes. Now, there are the following strategies to select and remove unhealthy endpoints: - Choose only the healthy endpoint within the lowest acceptable latency range. - The evicted endpoint can only rejoin if it is selected again for three consecutive times. Signed-off-by: JmPotato <ghzpotato@gmail.com>

ref #7499, ref #7730 Return the originally picked endpoints directly if all are evicted to gain better availability. Signed-off-by: JmPotato <ghzpotato@gmail.com>

JmPotato · 2024-02-06T08:46:57Z

Close with #7737.

JmPotato added type/enhancement The issue belongs to an enhancement. component/election Election related logic. labels Jan 18, 2024

HuSharp mentioned this issue Jan 18, 2024

etcdutil: remove client when etcd server is unhealthy #7729

Closed

This was referenced Jan 19, 2024

etcdutil: consider the latency while patrolling the healthy endpoints #7737

Merged

etcdutil: move health checker into a separate file #7743

Merged

ti-chi-bot bot pushed a commit that referenced this issue Jan 22, 2024

etcdutil: move health checker into a separate file (#7743)

d9789dc

ref #7730 Move the health checker into a separate file. Signed-off-by: JmPotato <ghzpotato@gmail.com>

JmPotato mentioned this issue Jan 30, 2024

etcdutil: return original endpoints when all are evicted #7779

Merged

JmPotato closed this as completed Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance the detection mechanism for the unhealthy etcd node #7730

Enhance the detection mechanism for the unhealthy etcd node #7730

JmPotato commented Jan 18, 2024

JmPotato commented Feb 6, 2024

Enhance the detection mechanism for the unhealthy etcd node #7730

Enhance the detection mechanism for the unhealthy etcd node #7730

Comments

JmPotato commented Jan 18, 2024

JmPotato commented Feb 6, 2024