Wait for blocksync goroutines on Stop to fix leveldb shutdown panic by masih · Pull Request #3415 · sei-protocol/sei-chain

masih · 2026-05-11T13:20:02Z

Reactor.OnStart and BlockPool.OnStart started their long-running goroutines (requestRoutine, poolRoutine, processBlockSyncCh, processPeerUpdates, makeRequestersRoutine) with raw go fn(ctx) using the outer context. They were therefore not registered with the BaseService WaitGroup, and Stop() never waited for them. The outer ctx also outlived Stop, so the goroutines kept running after Stop returned.

During node shutdown this raced nodeImpl.OnStop's blockStore.Close(): poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator, observed its leveldb table reader released and panicked with "leveldb/table: reader released".

Route each goroutine through BaseService.Spawn so it is tracked by the WaitGroup and bound to inner.ctx. Stop() now cancels them and blocks until they exit, which happens before the node closes the BlockStore DB. Add a regression test that asserts no blocksync goroutines remain after Reactor.Stop() returns.

Note

Medium Risk
Changes blocksync/consensus goroutine lifecycles and shutdown ordering; mistakes could cause hangs or missed transitions, but the change is localized and covered by a new regression test.

Overview
Fixes blocksync shutdown races by moving long-running goroutines off raw go launches and onto BaseService.Spawn/SpawnCritical, ensuring Stop() cancels the correct context and waits for all blocksync routines to exit before the block store is closed.

Adds readiness gates (blocksyncReady, consensusReady) so routines can be pre-spawned in Reactor.OnStart yet only begin work when block sync starts or the consensus handoff completes, and updates BlockPool/bpRequester shutdown to avoid blocking on a full requestsCh.

Updates the consensus handoff API (SwitchToConsensus signature) and adds a regression test (TestReactor_OnStopWaitsForGoroutines) that asserts no internal/blocksync goroutines remain after Reactor.Stop() returns.

^{Reviewed by Cursor Bugbot for commit 5479315. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-05-11T13:21:08Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	May 15, 2026, 2:22 PM

codecov · 2026-05-11T13:22:09Z

Codecov Report

❌ Patch coverage is 90.62500% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.32%. Comparing base (87e2d4f) to head (3ff41a0).

Files with missing lines	Patch %	Lines
sei-tendermint/internal/blocksync/reactor.go	86.11%	5 Missing ⚠️
sei-tendermint/internal/blocksync/pool.go	96.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3415      +/-   ##
==========================================
+ Coverage   59.28%   59.32%   +0.04%     
==========================================
  Files        2120     2120              
  Lines      175516   175556      +40     
==========================================
+ Hits       104053   104156     +103     
+ Misses      62393    62324      -69     
- Partials     9070     9076       +6

Flag	Coverage Δ
sei-chain-pr	`69.59% <90.62%> (?)`
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sei-tendermint/internal/consensus/reactor.go	`71.12% <100.00%> (+3.16%)`	⬆️
sei-tendermint/internal/blocksync/pool.go	`82.96% <96.29%> (+1.53%)`	⬆️
sei-tendermint/internal/blocksync/reactor.go	`64.64% <86.11%> (+3.26%)`	⬆️

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

masih · 2026-05-11T17:38:29Z

Marked back as draft to take a closer look at reactor code before opening back up for review

Reactor.OnStart, Reactor.SwitchToBlockSync, BlockPool.OnStart, and the auto-restart spawn inside poolRoutine all started their long-running goroutines with raw `go fn(ctx)` using the outer context. They were therefore not registered with the BaseService WaitGroup, and Stop() never waited for them. The outer ctx also outlived Stop, so the goroutines kept running after Stop returned. During node shutdown this raced nodeImpl.OnStop's blockStore.Close(): poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator, observed its leveldb table reader released and panicked with "leveldb/table: reader released". autoRestartIfBehind, which also reads from the BlockStore, has the same race. Mirror consensus.Reactor's SwitchToConsensus pattern: pre-spawn all long-running routines in OnStart through BaseService.Spawn, gate the conditional ones (requestRoutine, poolRoutine, autoRestartIfBehind) on utils.AtomicSend[bool] signals, and have SwitchToBlockSync and the blocksync->consensus handoff trigger those signals instead of spawning fresh goroutines. Stop() now cancels every blocksync goroutine via inner.ctx and blocks on inner.wg until they exit, which happens before the node closes the BlockStore DB. SwitchToConsensus still receives the node-scoped ctx (captured at OnStart) so the consensus reactor's handoff is not affected by blocksync's own cancellation. Add a regression test that asserts no blocksync goroutines remain after Reactor.Stop() returns.

poolRoutine is only ever called with state synced true.

…o masih/panic-leveldb-iter-tm

cursor · 2026-05-15T13:37:53Z

PR Summary

Medium Risk
Touches blocksync/consensus lifecycle and goroutine coordination; incorrect signaling could cause sync stalls or shutdown hangs, but changes are localized to service spawning and stop behavior.

Overview
Fixes blocksync shutdown races by routing long-running goroutines through BaseService.Spawn/SpawnCritical so Stop() cancels the inner context and waits for exit before the node closes the block store.

Reactor.OnStart now pre-spawns its routines and gates requestRoutine/poolRoutine and the post-handoff autoRestartIfBehind monitor using AtomicSend readiness signals, instead of spawning ad-hoc goroutines in OnStart/SwitchToBlockSync/handoff.

BlockPool and bpRequester similarly switch to Spawn, make sendRequest context-aware to avoid blocking on a full requestsCh, and adjust OnStop to cancel requester contexts and stop requesters outside the pool lock. Adds a regression test asserting no blocksync goroutines remain after Reactor.Stop() returns, and updates SwitchToConsensus call sites for the new signature.

^{Reviewed by Cursor Bugbot for commit 3ff41a0. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8c54806. Configure here.}

cursor · 2026-05-15T13:49:48Z

+	case pool.requestsCh <- BlockRequest{height, peerID}:
+		return true
+	case <-ctx.Done():
+		return false


Unprotected sendError can deadlock shutdown under new wg tracking

Low Severity

sendRequest was correctly updated to select on ctx.Done() to avoid blocking during shutdown, but sendError still performs an unconditional channel send. Previously this was benign because makeRequestersRoutine wasn't tracked by the WaitGroup. Now that it's routed through Spawn, if makeRequestersRoutine calls removeTimedoutPeers → sendError while errorsCh is full, it blocks. Since pool.wg.Wait() inside pool.Stop() now waits for makeRequestersRoutine, and pool.Stop() is called from Reactor.OnStop (which runs before reactor.wg.Wait), the reactor's requestRoutine that drains errorsCh may have already exited via its own ctx.Done(), creating a deadlock. The 1000-entry buffer makes this extremely unlikely in practice.

Additional Locations (1)

sei-tendermint/internal/blocksync/pool.go#L515-L526

^{Reviewed by Cursor Bugbot for commit 8c54806. Configure here.}

seidroid · 2026-05-15T14:52:14Z

Successfully created backport PR for release/v6.5:

Backport release/v6.5: Wait for blocksync goroutines on Stop to fix leveldb shutdown panic #3442

masih added non-app-hash-breaking backport release/v6.5 labels May 11, 2026

masih marked this pull request as ready for review May 11, 2026 13:50

masih requested review from sei-will and wen-coding May 11, 2026 13:54

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go

wen-coding approved these changes May 11, 2026

View reviewed changes

masih marked this pull request as draft May 11, 2026 17:35

pompon0 approved these changes May 11, 2026

View reviewed changes

pompon0 self-requested a review May 11, 2026 17:45

masih force-pushed the masih/panic-leveldb-iter-tm branch from e4972b7 to a548430 Compare May 14, 2026 13:26

masih requested a review from wen-coding May 14, 2026 13:26

masih added 3 commits May 14, 2026 14:27

Merge branch 'main' into masih/panic-leveldb-iter-tm

9c4d814

Remove redundant state synced flag parameterisation

667d611

poolRoutine is only ever called with state synced true.

Merge remote-tracking branch 'origin/masih/panic-leveldb-iter-tm' int…

a8cd864

…o masih/panic-leveldb-iter-tm

masih marked this pull request as ready for review May 14, 2026 13:49

pompon0 reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated

pompon0 reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated

pompon0 reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated

pompon0 reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated

pompon0 reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go

sei-will approved these changes May 14, 2026

View reviewed changes

Address review feedbacks

2f25615

masih requested a review from pompon0 May 14, 2026 17:20

Merge branch 'main' into masih/panic-leveldb-iter-tm

8925f4f

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated

pompon0 reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated

pompon0 reviewed May 14, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated

masih added 2 commits May 15, 2026 09:58

Address more feedbacks

8b5e22c

Merge branch 'main' into masih/panic-leveldb-iter-tm

d495b1c

masih requested a review from pompon0 May 15, 2026 08:59

Fix CI flaky tests

5479315

pompon0 reviewed May 15, 2026

View reviewed changes

Comment thread sei-tendermint/internal/blocksync/pool.go Outdated

pompon0 approved these changes May 15, 2026

View reviewed changes

Resolve nit

8c54806

masih enabled auto-merge May 15, 2026 13:37

cursor Bot reviewed May 15, 2026

View reviewed changes

masih added this pull request to the merge queue May 15, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 15, 2026

Merge branch 'main' into masih/panic-leveldb-iter-tm

3ff41a0

masih added this pull request to the merge queue May 15, 2026

Merged via the queue into main with commit a27b9d6 May 15, 2026
41 checks passed

masih deleted the masih/panic-leveldb-iter-tm branch May 15, 2026 14:51

seidroid Bot mentioned this pull request May 15, 2026

Backport release/v6.5: Wait for blocksync goroutines on Stop to fix leveldb shutdown panic #3442

Closed

Conversation

masih commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

masih commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Unprotected sendError can deadlock shutdown under new wg tracking

Uh oh!

Uh oh!

Uh oh!

seidroid Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masih commented May 11, 2026 •

edited by cursor Bot

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

codecov Bot commented May 11, 2026 •

edited

Loading

cursor Bot commented May 15, 2026 •

edited

Loading

Unprotected `sendError` can deadlock shutdown under new wg tracking