feat: accept stale channel monitors for recovery#76
Open
ben-kaufman wants to merge 18 commits intomainfrom
Open
feat: accept stale channel monitors for recovery#76ben-kaufman wants to merge 18 commits intomainfrom
ben-kaufman wants to merge 18 commits intomainfrom
Conversation
jvsena42
reviewed
Mar 18, 2026
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ben-kaufman
added a commit
to synonymdev/bitkit-ios
that referenced
this pull request
Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted UserDefaults flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ben-kaufman
added a commit
to synonymdev/bitkit-android
that referenced
this pull request
Mar 18, 2026
On BuildException.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted SharedPreferences flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
When a channel monitor's update_id falls behind the ChannelManager (e.g. after a migration overwrote newer data with stale backup data), LDK refuses to start with DangerousValue. This adds a recovery path: - Builder: new `set_accept_stale_channel_monitors(bool)` flag - Build: passes flag to ChannelManagerReadArgs, which force-syncs stale monitor update_ids instead of returning DangerousValue - Startup: when flag is set, defers chain sync while sending probes on all channels to trigger commitment round-trips that heal the stale monitor state. Polls monitor update_ids with 60s timeout and retries probes every 10s for late-connecting peers. The probe-triggered commitment round-trip provides: - LatestHolderCommitmentTXInfo (correct current commitment state) - CommitmentSecret (recovers all gap revocation secrets via the derivation tree) Depends on: ben-kaufman/rust-lightning#fix/accept-stale-channel-monitors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move inline imports (Path, RouteHop, NodeFeatures, ChannelFeatures) to top of lib.rs - Release is_running write lock before healing block_on to prevent stop() deadlock during recovery - Add stop_sender signal to healing loop so shutdown can cancel it - Early return after defer_chain_sync path (is_running already set) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract duplicated Path/RouteHop construction into build_probe_path closure - Address review feedback points with inline comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1 (CRITICAL): send_probe rejects single-hop paths ("No need probing
a path with less than two hops"). Replaced with send_spontaneous_payment
which sends a 1-sat keysend — works for direct single-hop channels.
Bug 2: start() spawned chain sync after stop() completed during healing.
Added is_running check after block_on returns.
Bug 3: accept_stale_channel_monitors flag was never cleared, causing
every restart on the same Node instance to re-trigger the 60s healing
delay. Changed to AtomicBool, cleared after healing completes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set max_total_routing_fee_msat = 0 to prevent the router from picking a multi-hop route through a different channel, which would heal the wrong monitor. With zero max fee, only the direct (zero-fee) route to the counterparty is used. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When multiple channels exist with the same peer, the router may pick the same channel for both payments — leaving the other unhealed. Fix: send one payment per unique counterparty (not per channel) and dedup in the retry loop. Each retry may route through a different channel as outbound capacity shifts from previous attempts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move HashSet (lib.rs) and AtomicBool (builder.rs) to top-level imports per CLAUDE.md: "ALWAYS move imports to the top of the file" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove peer dedup from healing payments — send one per unhealed channel (not per peer). The previous dedup prevented retries to the same peer even when some of their channels remained unhealed. 2. Fix TOCTOU race: subscribe to stop_sender while holding the is_running read lock before spawning chain sync. This prevents stop() from completing between the check and subscribe, which would orphan the task (missing the already-sent stop signal). Extracted spawn_chain_sync_task_with_receiver() so the normal path (spawn_chain_sync_task) still works unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cd33be3 to
cb60b2e
Compare
jvsena42
pushed a commit
to synonymdev/bitkit-ios
that referenced
this pull request
Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted UserDefaults flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BuildError::ReadFailed is returned by 19 different code paths (KVStore I/O errors, deserialization failures, task join errors). Using it to detect stale channel monitors in app-side retry logic causes false positives: a corrupt scorer or transient VSS timeout would burn the one-shot recovery flag, preventing actual stale monitor recovery later. Add a dedicated DangerousValue variant that is returned only when ChannelManager::read fails with DecodeError::DangerousValue (stale monitors vs manager desync). This lets apps catch the specific case without interfering with other ReadFailed scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
New commit:
|
…Value) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch rust-lightning fork to ovitrif/rust-lightning#0.2.2-accept-stale-monitors-v2 which resets the commitment secrets store in force_set_latest_update_id, preventing SIGABRT from a ChannelMonitorUpdateStatus mode mismatch after provide_secret() fails validation on a stale secrets tree. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
46c6e32 to
153ecbe
Compare
ben-kaufman
added a commit
to synonymdev/bitkit-ios
that referenced
this pull request
Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted UserDefaults flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ben-kaufman
added a commit
to synonymdev/bitkit-ios
that referenced
this pull request
Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted UserDefaults flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
coreyphillips
requested changes
Mar 19, 2026
coreyphillips
approved these changes
Mar 19, 2026
…rce-close The healing keysend creates an HTLC with cltv_expiry based on the ChannelManager's best block height. For users offline >24h (all affected users), the height is stale — the HTLC is already expired by the time chain sync catches up, causing LDK to force-close the channel (HTLCsTimedOut). This defeats the entire recovery. Fix: call sync_lightning_wallet() before sending healing payments. This updates the ChannelManager's chain tip to the current height so the HTLC gets a valid CLTV expiry. If sync fails, skip the keysend entirely to avoid the stale-CLTV force-close — the monitor will heal naturally on the first real user payment after continuous chain sync starts. The stale monitor processes blocks during this sync (accepted risk: Blocktank is trusted, and the monitor's force-synced update_id makes it "valid" from LDK's perspective — the only concern is counterparty force-close during the offline period, which Blocktank wouldn't do). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… (rc.36) When orphaned channel migration encounters an existing monitor that can't be deserialized (e.g. UnknownVersion), skip the migration instead of killing node startup with ReadFailed. The existing data is preserved. Also adds HTLC timeout force-close fix to CHANGELOG (from rc.35). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jvsena42
approved these changes
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
set_accept_stale_channel_monitors(bool)builder API for one-time recovery when a channel monitor'supdate_idfalls behind the ChannelManager (e.g. after migration overwrote newer monitor data with stale backup)DangerousValueThe probe-triggered commitment round-trip provides the monitor with:
LatestHolderCommitmentTXInfo— correct current commitment with real signaturesCommitmentSecret— recovers all gap revocation secrets via the derivation treeContext
Related to bitkit-android#847 and bitkit-ios#495.
Root cause: recent PRs in bitkit ios & android introduced orphaned channel recovery that re-fetched old RN monitors and passed them via
setChannelDataMigration(). The app was pinned to ldk-node pre-rc.33 which blindly overwrote existing VSS monitors without comparing update_ids. These PRs got released in a recent app update.Depends on: ovitrif/rust-lightning@0.2.2-accept-stale-monitors-v2 (patched
force_set_latest_update_id()that resetsCounterpartyCommitmentSecrets+accept_stale_channel_monitorsflag onChannelManagerReadArgs).Test plan
Related
🤖 Built with help from Claude Code