Fix stale internode gRPC connections after pod termination #9004

prathyushpv · 2026-01-12T17:56:20Z

What changed?

When Kubernetes pods are terminated, the cached gRPC connections in interNodeGrpcConnections remain stale, causing continuous "dial tcp ... i/o timeout" errors until the service is restarted.

This fix adds a membership listener to RPCFactory that:

Subscribes to membership ring change events for the history service
Evicts cached connections when hosts are removed from the ring
Properly closes evicted connections via cache RemovedFunc callback

Why?

To avoid repeated dial attempts to stale pod IPs

How did you test it?

alfred-landrum · 2026-01-13T18:48:49Z

👋 The history client has an internal GRPC connection pool, see client/history/connections.go. That predates your work to add a cache inside RPC factory #9004 . Does the history client connection pool need to be removed to work with the changes in this PR?

prathyushpv · 2026-01-14T19:00:15Z

Hi @alfred-landrum, Thanks for pointing this out! I just checked how history cache works. I see that all caches share same *grpc.ClientConn pointer. When HandleMembershipChange() in RPCFactory calls conn.Close(), it closes the actual gRPC connection object that all caches reference. After this, operations on this connection should return Unavailable error instead of timeout error. Caching redirector should evict the cache entry at this point:

temporal/client/history/caching_redirector.go

Line 108 in b6415c6

if maybeHostDownError(opErr) {

I see that caching redirector also listens for membership change events here:

temporal/client/history/caching_redirector.go

Line 244 in b6415c6

func (r *CachingRedirector[C]) staleCheck() {

But it waits for staleTTL time duration before removing the cache entry. What is the reason for this wait?

alfred-landrum · 2026-01-14T21:12:00Z

But it waits for staleTTL time duration before removing the cache entry. What is the reason for this wait?

It's to reduce latency for operations during shard movements. When the history service is told to stop, it leaves membership as a first step, then depending on config, waits a bit for connections & shard ownership to move. Even though it has left membership, the shards it owned may not have been acquired by their new owners. By using the cached entry for a small duration, we allow the new owner time to come up & initialize shards. (And it will then fence out requests from the old owner - so any requests to it will return shard ownership lost.)

prathyushpv added 7 commits January 12, 2026 09:54

Fix stale internode gRPC connections after pod termination

22f8448

Merge branch 'main' into ppv/connFix

c756162

Simplify test

25c1884

Merge branch 'main' into ppv/connFix

7f1faf0

Simplify test

12723cd

Add all services

f1a6bb3

Simplify

f1fcaaf

prathyushpv marked this pull request as ready for review January 12, 2026 19:59

prathyushpv requested review from a team as code owners January 12, 2026 19:59

Merge branch 'main' into ppv/connFix

b6415c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix stale internode gRPC connections after pod termination #9004

Fix stale internode gRPC connections after pod termination #9004

Uh oh!

prathyushpv commented Jan 12, 2026 •

edited

Loading

Uh oh!

alfred-landrum commented Jan 13, 2026

Uh oh!

prathyushpv commented Jan 14, 2026

Uh oh!

alfred-landrum commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix stale internode gRPC connections after pod termination #9004

Are you sure you want to change the base?

Fix stale internode gRPC connections after pod termination #9004

Uh oh!

Conversation

prathyushpv commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Uh oh!

alfred-landrum commented Jan 13, 2026

Uh oh!

prathyushpv commented Jan 14, 2026

Uh oh!

alfred-landrum commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prathyushpv commented Jan 12, 2026 •

edited

Loading