-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix stale internode gRPC connections after pod termination #9004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 The history client has an internal GRPC connection pool, see client/history/connections.go. That predates your work to add a cache inside RPC factory #9004 . Does the history client connection pool need to be removed to work with the changes in this PR? |
|
Hi @alfred-landrum, Thanks for pointing this out! I just checked how history cache works. I see that all caches share same *grpc.ClientConn pointer. When HandleMembershipChange() in RPCFactory calls conn.Close(), it closes the actual gRPC connection object that all caches reference. After this, operations on this connection should return Unavailable error instead of timeout error. Caching redirector should evict the cache entry at this point: temporal/client/history/caching_redirector.go Line 108 in b6415c6
I see that caching redirector also listens for membership change events here: temporal/client/history/caching_redirector.go Line 244 in b6415c6
But it waits for staleTTL time duration before removing the cache entry. What is the reason for this wait? |
It's to reduce latency for operations during shard movements. When the history service is told to stop, it leaves membership as a first step, then depending on config, waits a bit for connections & shard ownership to move. Even though it has left membership, the shards it owned may not have been acquired by their new owners. By using the cached entry for a small duration, we allow the new owner time to come up & initialize shards. (And it will then fence out requests from the old owner - so any requests to it will return shard ownership lost.) |
What changed?
When Kubernetes pods are terminated, the cached gRPC connections in interNodeGrpcConnections remain stale, causing continuous "dial tcp ... i/o timeout" errors until the service is restarted.
This fix adds a membership listener to RPCFactory that:
Why?
To avoid repeated dial attempts to stale pod IPs
How did you test it?