Skip to content

fix: resharing retrying & timeout#568

Merged
ChaoticTempest merged 16 commits intodevelopfrom
phuong/fix/resharing-timeout
Oct 22, 2025
Merged

fix: resharing retrying & timeout#568
ChaoticTempest merged 16 commits intodevelopfrom
phuong/fix/resharing-timeout

Conversation

@ChaoticTempest
Copy link
Copy Markdown
Contributor

This adds a fix to the resharing process where we will timeout if it takes too long to progress. The progress is determined by the last action (i.e. receiving a new message, sending messages)

We will default to a timeout of 5 minutes from the last action which should give nodes plenty of time to restart if they are doing so.

Additionally, I made resharing a two phase process. One is awaiting ready, and the other is running the resharing.

The awaiting step is where we will constantly broadcast that we're ready to start every 10 seconds. When all new participants including us is ready, we will go to the next phase which is running the resharing.

So this fixes several cases where:

  • node restart and then have to restart the whole resharing again. So here this is resolved by going back to awaiting ready and then running resharing.
  • starting the resharing protocol at different times and then timeout at different times. The awaiting step fixes this as now we all start roughly at the same time withing a 10 second interval of the broadcasting of readiness

This also adds a new resharing test so we can see that it works properly when a node is offline and restarts.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the resharing protocol to address timeout and retry issues. The key improvement is splitting resharing into two phases: an "awaiting ready" phase where nodes broadcast readiness, and a "running" phase where the actual resharing protocol executes. This ensures all participants start roughly simultaneously and allows nodes to restart and rejoin without abandoning the entire resharing process.

Key Changes:

  • Introduces a two-phase resharing process with timeout management
  • Adds ResharingReadyMessage protocol for coordinating participant readiness
  • Implements configurable timeout (default 5 minutes) for the running phase
  • Restructures state management to support phase transitions and retries

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
integration-tests/tests/cases/mod.rs Adds test verifying resharing completes when offline participant restarts
integration-tests/src/mpc_fixture/fixture_tasks.rs Updates message logging to include ResharingReady messages
integration-tests/src/mpc_fixture/builder.rs Removes unused config_rx parameter
chain-signatures/node/src/types.rs Simplifies ReshareProtocol by removing redundant fields and refresh method
chain-signatures/node/src/protocol/test_setup.rs Subscribes to resharing_ready channel in test setup
chain-signatures/node/src/protocol/state.rs Adds new state types for two-phase resharing with timeout tracking
chain-signatures/node/src/protocol/mod.rs Removes unused config parameter from progress method
chain-signatures/node/src/protocol/message/types.rs Defines ResharingReadyMessage type and protocol variant
chain-signatures/node/src/protocol/message/sub.rs Adds subscription support for ResharingReady messages
chain-signatures/node/src/protocol/message/mod.rs Implements inbox handling and subscription for resharing readiness
chain-signatures/node/src/protocol/cryptography.rs Implements two-phase resharing logic with timeout and retry mechanisms
chain-signatures/node/src/protocol/consensus.rs Updates consensus handling to use new resharing state structure
chain-signatures/node/src/cli.rs Subscribes to resharing_ready channel and removes config_rx parameter

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

}

pub fn set_resharing_running_timeout(duration: Duration) {
RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst);
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using swap instead of store suggests the old value might be used, but it's discarded here. Use store instead for clarity since the return value is unused.

Suggested change
RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst);
RESHARING_RUNNING_TIMEOUT_SECS.store(duration.as_secs(), Ordering::SeqCst);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpicky but Copilot has a point about the semantics fitting better to store than swap. 😉

While at it, another detail is that with swap + Ordering::SeqCst, the compiler has to generate an atomic load with appropriate memory fences to ensure all potentially outstanding stores (or swaps) of other threads to RESHARING_RUNNING_TIMEOUT_SECS have been observed by this thread before this thread can perform its own swap. This can lead to quite expensive thread stalling and CPU cache invalidations.

Using store instead of swap would get rid of the load. I think that would already avoid CPU-level synchronization for x86 targets thanks to their relatively stronger memory model.

If you further change Ordering::SeqCst to Ordering::Relaxed`, we should be able to entirely avoid expensive CPU synchronization instructions on all platforms. This is safe since we don't actually use this variable for multi-threaded synchronization. We just need every get and set to be atomic. Relaxed load / store is the best choice for this use case.

But again, this isn't important in the grand scheme of things. Feel free to change it if you want or keep it as is. It won't make a noticeable difference either way.

Copy link
Copy Markdown
Contributor

@volovyks volovyks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great improvement. And awesome test! Happy to merge it, not sure about those panic!()s.

tokio::time::sleep(Duration::from_secs(30)).await;
assert!(matches!(
nodes.contract_state().await?,
mpc_contract::ProtocolContractState::Resharing(_)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much effort would it take to fix these two issues? It is still possible to be "stuck" in the resharing state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be easy to add in the cancel of resharing, but for parallel signing, we need to consider somethings like if we're kicking somebody, would it be a security concern if we allow them to sign?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added cancelling resharing here: #570

Copy link
Copy Markdown
Contributor

@volovyks volovyks Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are kicking someone because of their misbehavior, the voting process will take several hours in the best-case scenario. The resharing protocol is relatively fast, so it does not matter much (5h vs 5h+1m).

The main reason we wanted parallel signing and resharing was possible downtime. Especially, downtime caused by failed resharing.

@ChaoticTempest
Copy link
Copy Markdown
Contributor Author

Added another test_resharing_running_participant_restart that does the restart in the middle of reshare is running. This had some other issues that arose like resharing consuming previous reshare messages.

To fix that, each node during every awaiting step creates their own separate token to be broadcasted out with the ReadyMessage. These tokens are combined later when all new participants are ready to form a new unique token to this very specific reshare running phase such that new reshare running phase would have an entirely different token.

So reshare messages match against these tokens, if the token is not correct, ignore the message.

@ChaoticTempest ChaoticTempest force-pushed the phuong/fix/resharing-timeout branch from 6c44e66 to 2246049 Compare October 22, 2025 05:52
Copy link
Copy Markdown
Contributor

@volovyks volovyks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inventing a new approach for message filtering isn’t the best idea 🙂 But adopting a single universal approach for all protocols would be overkill.

@ChaoticTempest ChaoticTempest merged commit 7dca97a into develop Oct 22, 2025
3 checks passed
@ChaoticTempest ChaoticTempest deleted the phuong/fix/resharing-timeout branch October 22, 2025 17:34
Copy link
Copy Markdown
Contributor

@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! This is miles more robust than what we had before!

Long-term, keep in mind that this only solves crashes but not malicious behavior. I believe this approach would not work if, for example, one malicious node sends different my_token values to each peer.

We'll have to add it to the list of necessary improvements for byzantine fault-tolerance.

}

pub fn set_resharing_running_timeout(duration: Duration) {
RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpicky but Copilot has a point about the semantics fitting better to store than swap. 😉

While at it, another detail is that with swap + Ordering::SeqCst, the compiler has to generate an atomic load with appropriate memory fences to ensure all potentially outstanding stores (or swaps) of other threads to RESHARING_RUNNING_TIMEOUT_SECS have been observed by this thread before this thread can perform its own swap. This can lead to quite expensive thread stalling and CPU cache invalidations.

Using store instead of swap would get rid of the load. I think that would already avoid CPU-level synchronization for x86 targets thanks to their relatively stronger memory model.

If you further change Ordering::SeqCst to Ordering::Relaxed`, we should be able to entirely avoid expensive CPU synchronization instructions on all platforms. This is safe since we don't actually use this variable for multi-threaded synchronization. We just need every get and set to be atomic. Relaxed load / store is the best choice for this use case.

But again, this isn't important in the grand scheme of things. Feel free to change it if you want or keep it as is. It won't make a noticeable difference either way.

ready_tokens: std::iter::once((me, my_token)).collect(),
my_token,
// ready to broadcast immediately
broadcast_interval: Instant::now() - RESHARING_READY_BROADCAST_INTERVAL,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can technically panic. Instant is an opaque type that may start counting at 0 when the program starts. (I think I've observed this with WASM targets.) If we get here in the first 10s of the program, subtraction will negative overflow.

It probably happens to be fine for our use case. But API wise, a std::time::Instant should only be used to compare it to another Instant, not to some global clock.

@ChaoticTempest
Copy link
Copy Markdown
Contributor Author

Long-term, keep in mind that this only solves crashes but not malicious behavior. I believe this approach would not work if, for example, one malicious node sends different my_token values to each peer.

We'll have to add it to the list of necessary improvements for byzantine fault-tolerance.

AFAIU, we probably cannot add BFT to the resharing step aside from restarting the reshare protocol with a different new participant set.

To visualize this a bit, for resharing, we have old participant set and new participant set, where a subset of the old participants must also be in the new participants. For a participant to be able to get their new key share, they have to communicate. That means they need to be online, which also means all new participants have to be online to get their keyshare. So this would mean that new participants have to be honest. We can maintain up to threshold amount of old participants that must be honest, and can be included in the new participant set.

If we're in resharing running step, and a node behaves maliciously, we really have to just kick them out, and then restart the whole resharing process again. This is due to cait-sith not allowing us to adjust the participants inflight during resharing

@jakmeier
Copy link
Copy Markdown
Contributor

Thanks for the details @ChaoticTempest !

I see, the resharing algorithm itself may not be suitable to be made BFT. But I believe we can still work on:

  • Detect when a node behaves dishonest.
  • Ensure bad participants can't disrupt the entire process indefinitely.

That said, I don't think making resharing BFT is a useful thing to work on right now. We can always fall back to manual intervention in a process that's only triggered manually anyways. As long as we know the limitations of it, I'm very happy with what we have now with this PR merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants