fix: resharing retrying & timeout by ChaoticTempest · Pull Request #568 · sig-net/mpc

ChaoticTempest · 2025-10-21T06:08:58Z

This adds a fix to the resharing process where we will timeout if it takes too long to progress. The progress is determined by the last action (i.e. receiving a new message, sending messages)

We will default to a timeout of 5 minutes from the last action which should give nodes plenty of time to restart if they are doing so.

Additionally, I made resharing a two phase process. One is awaiting ready, and the other is running the resharing.

The awaiting step is where we will constantly broadcast that we're ready to start every 10 seconds. When all new participants including us is ready, we will go to the next phase which is running the resharing.

So this fixes several cases where:

node restart and then have to restart the whole resharing again. So here this is resolved by going back to awaiting ready and then running resharing.
starting the resharing protocol at different times and then timeout at different times. The awaiting step fixes this as now we all start roughly at the same time withing a 10 second interval of the broadcasting of readiness

This also adds a new resharing test so we can see that it works properly when a node is offline and restarts.

Copilot

Pull Request Overview

This PR refactors the resharing protocol to address timeout and retry issues. The key improvement is splitting resharing into two phases: an "awaiting ready" phase where nodes broadcast readiness, and a "running" phase where the actual resharing protocol executes. This ensures all participants start roughly simultaneously and allows nodes to restart and rejoin without abandoning the entire resharing process.

Key Changes:

Introduces a two-phase resharing process with timeout management
Adds ResharingReadyMessage protocol for coordinating participant readiness
Implements configurable timeout (default 5 minutes) for the running phase
Restructures state management to support phase transitions and retries

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
integration-tests/tests/cases/mod.rs	Adds test verifying resharing completes when offline participant restarts
integration-tests/src/mpc_fixture/fixture_tasks.rs	Updates message logging to include ResharingReady messages
integration-tests/src/mpc_fixture/builder.rs	Removes unused config_rx parameter
chain-signatures/node/src/types.rs	Simplifies ReshareProtocol by removing redundant fields and refresh method
chain-signatures/node/src/protocol/test_setup.rs	Subscribes to resharing_ready channel in test setup
chain-signatures/node/src/protocol/state.rs	Adds new state types for two-phase resharing with timeout tracking
chain-signatures/node/src/protocol/mod.rs	Removes unused config parameter from progress method
chain-signatures/node/src/protocol/message/types.rs	Defines ResharingReadyMessage type and protocol variant
chain-signatures/node/src/protocol/message/sub.rs	Adds subscription support for ResharingReady messages
chain-signatures/node/src/protocol/message/mod.rs	Implements inbox handling and subscription for resharing readiness
chain-signatures/node/src/protocol/cryptography.rs	Implements two-phase resharing logic with timeout and retry mechanisms
chain-signatures/node/src/protocol/consensus.rs	Updates consensus handling to use new resharing state structure
chain-signatures/node/src/cli.rs	Subscribes to resharing_ready channel and removes config_rx parameter

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

chain-signatures/node/src/protocol/state.rs

Copilot · 2025-10-21T06:09:40Z

chain-signatures/node/src/protocol/cryptography.rs

+}
+
+pub fn set_resharing_running_timeout(duration: Duration) {
+    RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst);


Using swap instead of store suggests the old value might be used, but it's discarded here. Use store instead for clarity since the return value is unused.

Suggested change

RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst);

RESHARING_RUNNING_TIMEOUT_SECS.store(duration.as_secs(), Ordering::SeqCst);

Super nitpicky but Copilot has a point about the semantics fitting better to store than swap. 😉

While at it, another detail is that with swap + Ordering::SeqCst, the compiler has to generate an atomic load with appropriate memory fences to ensure all potentially outstanding stores (or swaps) of other threads to RESHARING_RUNNING_TIMEOUT_SECS have been observed by this thread before this thread can perform its own swap. This can lead to quite expensive thread stalling and CPU cache invalidations.

Using store instead of swap would get rid of the load. I think that would already avoid CPU-level synchronization for x86 targets thanks to their relatively stronger memory model.

If you further change Ordering::SeqCst to Ordering::Relaxed`, we should be able to entirely avoid expensive CPU synchronization instructions on all platforms. This is safe since we don't actually use this variable for multi-threaded synchronization. We just need every get and set to be atomic. Relaxed load / store is the best choice for this use case.

But again, this isn't important in the grand scheme of things. Feel free to change it if you want or keep it as is. It won't make a noticeable difference either way.

volovyks

Wow, great improvement. And awesome test! Happy to merge it, not sure about those panic!()s.

volovyks · 2025-10-21T15:02:14Z

integration-tests/tests/cases/mod.rs

+    tokio::time::sleep(Duration::from_secs(30)).await;
+    assert!(matches!(
+        nodes.contract_state().await?,
+        mpc_contract::ProtocolContractState::Resharing(_)


How much effort would it take to fix these two issues? It is still possible to be "stuck" in the resharing state.

Parallel Signing & Resharing #272

Add the ability to cancel resharing #144

It'll be easy to add in the cancel of resharing, but for parallel signing, we need to consider somethings like if we're kicking somebody, would it be a security concern if we allow them to sign?

added cancelling resharing here: #570

If we are kicking someone because of their misbehavior, the voting process will take several hours in the best-case scenario. The resharing protocol is relatively fast, so it does not matter much (5h vs 5h+1m).

The main reason we wanted parallel signing and resharing was possible downtime. Especially, downtime caused by failed resharing.

chain-signatures/node/src/protocol/state.rs

chain-signatures/node/src/protocol/message/mod.rs

ChaoticTempest · 2025-10-22T05:46:44Z

Added another test_resharing_running_participant_restart that does the restart in the middle of reshare is running. This had some other issues that arose like resharing consuming previous reshare messages.

To fix that, each node during every awaiting step creates their own separate token to be broadcasted out with the ReadyMessage. These tokens are combined later when all new participants are ready to form a new unique token to this very specific reshare running phase such that new reshare running phase would have an entirely different token.

So reshare messages match against these tokens, if the token is not correct, ignore the message.

volovyks

Inventing a new approach for message filtering isn’t the best idea 🙂 But adopting a single universal approach for all protocols would be overkill.

jakmeier

Really nice! This is miles more robust than what we had before!

Long-term, keep in mind that this only solves crashes but not malicious behavior. I believe this approach would not work if, for example, one malicious node sends different my_token values to each peer.

We'll have to add it to the list of necessary improvements for byzantine fault-tolerance.

jakmeier · 2025-10-23T13:56:28Z

chain-signatures/node/src/protocol/cryptography.rs

+}
+
+pub fn set_resharing_running_timeout(duration: Duration) {
+    RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst);


Super nitpicky but Copilot has a point about the semantics fitting better to store than swap. 😉

While at it, another detail is that with swap + Ordering::SeqCst, the compiler has to generate an atomic load with appropriate memory fences to ensure all potentially outstanding stores (or swaps) of other threads to RESHARING_RUNNING_TIMEOUT_SECS have been observed by this thread before this thread can perform its own swap. This can lead to quite expensive thread stalling and CPU cache invalidations.

Using store instead of swap would get rid of the load. I think that would already avoid CPU-level synchronization for x86 targets thanks to their relatively stronger memory model.

If you further change Ordering::SeqCst to Ordering::Relaxed`, we should be able to entirely avoid expensive CPU synchronization instructions on all platforms. This is safe since we don't actually use this variable for multi-threaded synchronization. We just need every get and set to be atomic. Relaxed load / store is the best choice for this use case.

But again, this isn't important in the grand scheme of things. Feel free to change it if you want or keep it as is. It won't make a noticeable difference either way.

jakmeier · 2025-10-23T14:48:24Z

chain-signatures/node/src/protocol/state.rs

+            ready_tokens: std::iter::once((me, my_token)).collect(),
+            my_token,
+            // ready to broadcast immediately
+            broadcast_interval: Instant::now() - RESHARING_READY_BROADCAST_INTERVAL,


I think this can technically panic. Instant is an opaque type that may start counting at 0 when the program starts. (I think I've observed this with WASM targets.) If we get here in the first 10s of the program, subtraction will negative overflow.

It probably happens to be fine for our use case. But API wise, a std::time::Instant should only be used to compare it to another Instant, not to some global clock.

ChaoticTempest · 2025-10-23T23:33:37Z

Long-term, keep in mind that this only solves crashes but not malicious behavior. I believe this approach would not work if, for example, one malicious node sends different my_token values to each peer.

We'll have to add it to the list of necessary improvements for byzantine fault-tolerance.

AFAIU, we probably cannot add BFT to the resharing step aside from restarting the reshare protocol with a different new participant set.

To visualize this a bit, for resharing, we have old participant set and new participant set, where a subset of the old participants must also be in the new participants. For a participant to be able to get their new key share, they have to communicate. That means they need to be online, which also means all new participants have to be online to get their keyshare. So this would mean that new participants have to be honest. We can maintain up to threshold amount of old participants that must be honest, and can be included in the new participant set.

If we're in resharing running step, and a node behaves maliciously, we really have to just kick them out, and then restart the whole resharing process again. This is due to cait-sith not allowing us to adjust the participants inflight during resharing

jakmeier · 2025-10-30T08:19:54Z

Thanks for the details @ChaoticTempest !

I see, the resharing algorithm itself may not be suitable to be made BFT. But I believe we can still work on:

Detect when a node behaves dishonest.
Ensure bad participants can't disrupt the entire process indefinitely.

That said, I don't think making resharing BFT is a useful thing to work on right now. We can always fall back to manual intervention in a process that's only triggered manually anyways. As long as we know the limitations of it, I'm very happy with what we have now with this PR merged in.

ChaoticTempest added 9 commits October 17, 2025 23:11

Initial resharing fix

7d126f8

Get resharing with offline participant working

b3d463f

Cleanup

6a5b58b

Add nonce to resharing ready messages

40f70b8

Cleanup

b7f0940

Cleanup

6f33811

More cleanup

89d501d

Make timeout for resharing shorter for testing

1b73f65

Remove refreshing resharing

9de60a7

ChaoticTempest requested review from Copilot, jakmeier and volovyks October 21, 2025 06:08

Copilot AI reviewed Oct 21, 2025

View reviewed changes

volovyks reviewed Oct 21, 2025

View reviewed changes

ChaoticTempest added 6 commits October 21, 2025 18:45

Renaming

da74c7c

Use HashSet instead for BTreeSet for ready set

d3388e7

Add reshare running test

49ce236

Add attempt id per resharing

e47d3a3

Remove buffered

fe62346

Cleanup

8dd78cc

More cleanup

2246049

ChaoticTempest force-pushed the phuong/fix/resharing-timeout branch from 6c44e66 to 2246049 Compare October 22, 2025 05:52

volovyks approved these changes Oct 22, 2025

View reviewed changes

ChaoticTempest merged commit 7dca97a into develop Oct 22, 2025
3 checks passed

ChaoticTempest deleted the phuong/fix/resharing-timeout branch October 22, 2025 17:34

This was referenced Oct 23, 2025

Add the ability to cancel resharing #144

Closed

Resharing should be handled properly during NEAR network congestion #16

Closed

jakmeier reviewed Oct 23, 2025

View reviewed changes

	RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst);
	RESHARING_RUNNING_TIMEOUT_SECS.store(duration.as_secs(), Ordering::SeqCst);

Conversation

ChaoticTempest commented Oct 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

jakmeier Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

volovyks left a comment

Choose a reason for hiding this comment

Uh oh!

volovyks Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

ChaoticTempest Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

ChaoticTempest Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

volovyks Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChaoticTempest commented Oct 22, 2025

Uh oh!

volovyks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jakmeier left a comment

Choose a reason for hiding this comment

Uh oh!

jakmeier Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

jakmeier Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

ChaoticTempest commented Oct 23, 2025

Uh oh!

jakmeier commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

volovyks Oct 22, 2025 •

edited

Loading