Description
What would you like to be added?
Background
gRPC Health checks are used to probe whether the server is able to handle RPCs. A server may choose to reply “unhealthy” because it is not ready to take requests, it is shutting down or some other reason. The client can act accordingly if the response is not received within some time window or the response says unhealthy in it.
ref.
- https://github.com/grpc/grpc/blob/master/doc/health-checking.md
- https://github.com/grpc/proposal/blob/master/A17-client-side-health-checking.md
#8121 added basic grpc health service only on server side since etcd v3.3.
etcd/server/etcdserver/api/v3rpc/grpc.go
Lines 81 to 86 in 6220174
Problem
In a multi etcd server endpoints scenario, etcd client only fails over to the other endpoint when existing connection/channel is in not in Ready
state. However, etcd client does not know about if etcd server can handle RPCs.
For example
- defrag stops the process completely for noticeable duration #9222
- Removed etcd member failed to stop on stuck disk #14338
It needs a comprehensive design and testing.
Placeholder google doc etcd client grpc health check copied from the KEP template.
Why is this needed?
Improve etcd availability by failing over to other healthy etcd endpoints