Skip to content

v2.11.24

Latest

Choose a tag to compare

@rsafonseca rsafonseca released this 18 Jun 15:10

Bug fix: shutdown panic send on closed channel in the NATS connection handler

Fixes a race condition that could crash a Pitaya server with panic: send on closed channel during shutdown.

Cause

The app's dieChan is a single shared channel that is closed by (*App).Shutdown but sent to by NATS' ClosedHandler (setupNatsConn, cluster/nats_rpc_common.go). When a NATS connection drops with an error concurrently with shutdown, the handler runs on nats.go's async callback dispatcher and executes appDieChan <- true after Shutdown has already closed the channel — and a send on a closed channel panics, even inside a select.

The handler early-returns on a clean close (LastError() == nil), so this only triggers when NATS becomes unreachable with an error at the same time the server is terminating (e.g. a NATS rollout or network blip during a pod shutdown) — which is why it was intermittent.

Impact

The panic is raised on the async dispatcher goroutine and is unrecovered, so it terminates the whole process immediately. That aborts the rest of graceful shutdown, including etcd service-discovery deregistration — leaving a ghost entry in service discovery until its lease TTL expires, during which healthy peers may keep routing RPCs to the dead server.

Fix

The appDieChan notification in the ClosedHandler is now guarded: if the channel has already been closed (the app is already terminating and needs no signal), the resulting panic is recovered and the handler returns quietly. All other paths (initial-connect error propagation, process-signal fallback) are unchanged. Includes a regression test that reproduces the panic against a closed dieChan.

Commit: 45717a2