fix(node): async chain.db save off BFT critical path (Fix A) by satyakwok · Pull Request #565 · sentrix-labs/sentrix

satyakwok · 2026-05-10T22:00:13Z

Summary

`save_blockchain` previously fsync-blocked the validator loop inside
the `BftAction::FinalizeBlock` arm. On mainnet that was ~500 ms-1 s
per block on a 5 GB chain.db, sitting in front of the next round's
propose call.

Move it to a single tokio writer task drained from a bounded mpsc
channel. The writer briefly takes the read lock to serialise state,
releases it, commits MDBX. Multiple queued heights coalesce into one
snapshot.

Why

Per-block timing on mainnet (BFT phases ~200 ms, save_blockchain
~500 ms-1 s, propose-build ~100 ms) — the synchronous disk save was
the single biggest contributor to mainnet WAN bt of ~2.7 s/blk under
PR #564 transport. Pipelining the save off the critical path lets the
next round start immediately after the in-memory commit.

Crash safety

The B2 load-replay path (PR #556) already covers a crash between the
in-memory commit and a queued save: on restart, `load_blockchain`
detects `disk_height > blob_height` and replays the missing blocks
via `add_block_from_peer`. Fix A relies on that mechanism for
unsaved-on-shutdown safety.

The shutdown signal handler still does one final synchronous
`save_blockchain` after the validator task exits — explicit safety net.

`save_block` (block bytes only, on the peer-propose path) stays
synchronous; the small write is kept on the critical path so we don't
broadcast a block whose bytes never reached disk.

Deploy results — 2026-05-10 ~00:00 WIB

Binary sha `9345dd807b10254a1dd678ce707f0e2f69d8a79ab3cc558ef4f9e8957a093edc`
built in `rust:1.95-bullseye`, deployed across all 11 nodes via
halt-all + simul-start.

60-second rolling window after stabilisation:

Network	Pre-Fix A	Post-Fix A	Δ
Testnet	0.7 s/blk	0.72 s/blk	~same (already in target)
Mainnet	2.73 s/blk	1.67 s/blk	-39 %

Test plan

Build in `rust:1.95-bullseye` (GLIBC ≤ 2.30).
Halt-all + simul-start all 11 nodes (testnet 5 + mainnet 6).
Verify both networks advancing.
Measure 60s rolling bt on mainnet — confirm improvement.
Confirm `save queue full` warnings stay near zero.
Bake ≥1 h; confirm no chain.db replay anomalies on restart.

Per-peer libp2p request-response is single-attempt-delivery: when a prevote or precommit lost a hop in the WAN mesh, the only retry was our manual 0.5s × 6 tick from the validator loop. With validators ~10-50ms apart across hosts each phase paid that latency × peer-count, so mainnet bt sat at ~3.6 s/blk while testnet (localhost) was 0.46 s. Switches the four BFT message classes to gossipsub mesh topics: sentrix/bft/proposal/1 sentrix/bft/prevote/1 sentrix/bft/precommit/1 sentrix/bft/round-status/1 Mesh fan-out + IHAVE/IWANT lazy-push handle retransmission natively; the validator-side vote rebroadcast tick is gone (proposal rebroadcast stays — it replays the saved signed proposal verbatim, separate from the missed-vote retry pattern). SwarmCommand::Broadcast variant is dead and removed. Wire types: new GossipBft{Proposal,Prevote,Precommit,RoundStatus} envelopes alongside the existing GossipBlock / GossipTransaction. SENTRIX_PROTOCOL bumped 2.0.0 → 2.1.0 so old peers (RR-only BFT) can't silently interop with new peers (gossipsub-only BFT). Deploy: halt-all + simul-start. Mid-deploy a validator on the old binary would gossip nothing the new binary subscribes to, and vice versa. Same procedure as the 2026-05-10 evening swap to v2.1.91 + watchdog removal. Inbound boundary checks (verify_sig + is_active_bft_signer) mirror the existing RR path so byzantine peers can't push forged votes into the engine via either transport. RR BFT request handlers stay in place as a defensive no-op for any peer still attempting the old path; the protocol negotiation will reject them anyway.

save_blockchain previously fsync-blocked the validator loop inside the BftAction::FinalizeBlock arm — on mainnet that was 500 ms-1 s per block on a 5 GB chain.db, sitting in front of the next round's propose call. Move it to a single tokio writer task drained from an mpsc channel. - Writer takes a brief read lock to serialise the state blob, releases it, commits MDBX. Multiple queued heights coalesce into one snapshot since save_blockchain always writes the latest state. - B2 load-replay (PR #556) already covers the crash window between the in-memory commit and the queued disk save. - Shutdown still does one final sync save_blockchain after the validator task exits — safety net. - save_block (block bytes only) on the peer-propose path stays sync; it's small and we don't broadcast unless the block bytes hit disk.

codecov · 2026-05-10T22:01:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Captures the post-#564 stack as a distinct version label: - #564 BFT votes over gossipsub (wire SENTRIX_PROTOCOL 2.0.0 -> 2.1.0) - #565 Fix A: async chain.db save off BFT critical path - #566 Fix C: speculative pre-build of next proposal - #567 self-describe chain_name from genesis + load-fixup - #568 remove inbound-silence watchdog Multiple distinct binaries shipped under 2.1.91 today during the mainnet stall recovery cycle. Bumping so the next build maps 1:1 to a single sha + version label. Going forward: every chain-touching PR bumps in the same commit set (see operator memory feedback_bump_version_per_fix.md).

satyakwok added 2 commits May 10, 2026 23:07

github-actions Bot enabled auto-merge (squash) May 10, 2026 22:00

github-actions Bot merged commit afdc47f into main May 10, 2026
8 checks passed

satyakwok mentioned this pull request May 10, 2026

fix(node): speculative pre-build of next proposal (Fix C) #566

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(node): async chain.db save off BFT critical path (Fix A)#565

fix(node): async chain.db save off BFT critical path (Fix A)#565
github-actions[bot] merged 2 commits into
mainfrom
feat/async-chain-save

satyakwok commented May 10, 2026

Uh oh!

codecov Bot commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

satyakwok commented May 10, 2026

Summary

Why

Crash safety

Deploy results — 2026-05-10 ~00:00 WIB

Test plan

Uh oh!

codecov Bot commented May 10, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant