Wait for blocksync goroutines on Stop to fix leveldb shutdown panic#3415
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3415 +/- ##
==========================================
+ Coverage 59.28% 59.32% +0.04%
==========================================
Files 2120 2120
Lines 175516 175556 +40
==========================================
+ Hits 104053 104156 +103
+ Misses 62393 62324 -69
- Partials 9070 9076 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Marked back as draft to take a closer look at reactor code before opening back up for review |
Reactor.OnStart, Reactor.SwitchToBlockSync, BlockPool.OnStart, and the auto-restart spawn inside poolRoutine all started their long-running goroutines with raw `go fn(ctx)` using the outer context. They were therefore not registered with the BaseService WaitGroup, and Stop() never waited for them. The outer ctx also outlived Stop, so the goroutines kept running after Stop returned. During node shutdown this raced nodeImpl.OnStop's blockStore.Close(): poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator, observed its leveldb table reader released and panicked with "leveldb/table: reader released". autoRestartIfBehind, which also reads from the BlockStore, has the same race. Mirror consensus.Reactor's SwitchToConsensus pattern: pre-spawn all long-running routines in OnStart through BaseService.Spawn, gate the conditional ones (requestRoutine, poolRoutine, autoRestartIfBehind) on utils.AtomicSend[bool] signals, and have SwitchToBlockSync and the blocksync->consensus handoff trigger those signals instead of spawning fresh goroutines. Stop() now cancels every blocksync goroutine via inner.ctx and blocks on inner.wg until they exit, which happens before the node closes the BlockStore DB. SwitchToConsensus still receives the node-scoped ctx (captured at OnStart) so the consensus reactor's handoff is not affected by blocksync's own cancellation. Add a regression test that asserts no blocksync goroutines remain after Reactor.Stop() returns.
e4972b7 to
a548430
Compare
poolRoutine is only ever called with state synced true.
…o masih/panic-leveldb-iter-tm
PR SummaryMedium Risk Overview
Reviewed by Cursor Bugbot for commit 3ff41a0. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8c54806. Configure here.
| case pool.requestsCh <- BlockRequest{height, peerID}: | ||
| return true | ||
| case <-ctx.Done(): | ||
| return false |
There was a problem hiding this comment.
Unprotected sendError can deadlock shutdown under new wg tracking
Low Severity
sendRequest was correctly updated to select on ctx.Done() to avoid blocking during shutdown, but sendError still performs an unconditional channel send. Previously this was benign because makeRequestersRoutine wasn't tracked by the WaitGroup. Now that it's routed through Spawn, if makeRequestersRoutine calls removeTimedoutPeers → sendError while errorsCh is full, it blocks. Since pool.wg.Wait() inside pool.Stop() now waits for makeRequestersRoutine, and pool.Stop() is called from Reactor.OnStop (which runs before reactor.wg.Wait), the reactor's requestRoutine that drains errorsCh may have already exited via its own ctx.Done(), creating a deadlock. The 1000-entry buffer makes this extremely unlikely in practice.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8c54806. Configure here.
|
Successfully created backport PR for |


Reactor.OnStart and BlockPool.OnStart started their long-running goroutines (requestRoutine, poolRoutine, processBlockSyncCh, processPeerUpdates, makeRequestersRoutine) with raw
go fn(ctx)using the outer context. They were therefore not registered with the BaseService WaitGroup, and Stop() never waited for them. The outer ctx also outlived Stop, so the goroutines kept running after Stop returned.During node shutdown this raced nodeImpl.OnStop's blockStore.Close(): poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator, observed its leveldb table reader released and panicked with "leveldb/table: reader released".
Route each goroutine through BaseService.Spawn so it is tracked by the WaitGroup and bound to inner.ctx. Stop() now cancels them and blocks until they exit, which happens before the node closes the BlockStore DB. Add a regression test that asserts no blocksync goroutines remain after Reactor.Stop() returns.
Note
Medium Risk
Changes blocksync/consensus goroutine lifecycles and shutdown ordering; mistakes could cause hangs or missed transitions, but the change is localized and covered by a new regression test.
Overview
Fixes blocksync shutdown races by moving long-running goroutines off raw
golaunches and ontoBaseService.Spawn/SpawnCritical, ensuringStop()cancels the correct context and waits for all blocksync routines to exit before the block store is closed.Adds readiness gates (
blocksyncReady,consensusReady) so routines can be pre-spawned inReactor.OnStartyet only begin work when block sync starts or the consensus handoff completes, and updatesBlockPool/bpRequestershutdown to avoid blocking on a fullrequestsCh.Updates the consensus handoff API (
SwitchToConsensussignature) and adds a regression test (TestReactor_OnStopWaitsForGoroutines) that asserts nointernal/blocksyncgoroutines remain afterReactor.Stop()returns.Reviewed by Cursor Bugbot for commit 5479315. Bugbot is set up for automated code reviews on this repo. Configure here.