Skip to content

Conversation

@prathyushpv
Copy link
Contributor

@prathyushpv prathyushpv commented Jan 12, 2026

What changed?

When Kubernetes pods are terminated, the cached gRPC connections in interNodeGrpcConnections remain stale, causing continuous "dial tcp ... i/o timeout" errors until the service is restarted.

This fix adds a membership listener to RPCFactory that:

  • Subscribes to membership ring change events for the history service
  • Evicts cached connections when hosts are removed from the ring
  • Properly closes evicted connections via cache RemovedFunc callback

Why?

To avoid repeated dial attempts to stale pod IPs

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

@prathyushpv prathyushpv marked this pull request as ready for review January 12, 2026 19:59
@prathyushpv prathyushpv requested review from a team as code owners January 12, 2026 19:59
@alfred-landrum
Copy link
Contributor

👋 The history client has an internal GRPC connection pool, see client/history/connections.go. That predates your work to add a cache inside RPC factory #9004 . Does the history client connection pool need to be removed to work with the changes in this PR?

@prathyushpv
Copy link
Contributor Author

Hi @alfred-landrum, Thanks for pointing this out! I just checked how history cache works. I see that all caches share same *grpc.ClientConn pointer. When HandleMembershipChange() in RPCFactory calls conn.Close(), it closes the actual gRPC connection object that all caches reference. After this, operations on this connection should return Unavailable error instead of timeout error. Caching redirector should evict the cache entry at this point:

if maybeHostDownError(opErr) {

I see that caching redirector also listens for membership change events here:

func (r *CachingRedirector[C]) staleCheck() {

But it waits for staleTTL time duration before removing the cache entry. What is the reason for this wait?

@alfred-landrum
Copy link
Contributor

But it waits for staleTTL time duration before removing the cache entry. What is the reason for this wait?

It's to reduce latency for operations during shard movements. When the history service is told to stop, it leaves membership as a first step, then depending on config, waits a bit for connections & shard ownership to move. Even though it has left membership, the shards it owned may not have been acquired by their new owners. By using the cached entry for a small duration, we allow the new owner time to come up & initialize shards. (And it will then fence out requests from the old owner - so any requests to it will return shard ownership lost.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants