fix: resharing retrying & timeout#568
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the resharing protocol to address timeout and retry issues. The key improvement is splitting resharing into two phases: an "awaiting ready" phase where nodes broadcast readiness, and a "running" phase where the actual resharing protocol executes. This ensures all participants start roughly simultaneously and allows nodes to restart and rejoin without abandoning the entire resharing process.
Key Changes:
- Introduces a two-phase resharing process with timeout management
- Adds
ResharingReadyMessageprotocol for coordinating participant readiness - Implements configurable timeout (default 5 minutes) for the running phase
- Restructures state management to support phase transitions and retries
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| integration-tests/tests/cases/mod.rs | Adds test verifying resharing completes when offline participant restarts |
| integration-tests/src/mpc_fixture/fixture_tasks.rs | Updates message logging to include ResharingReady messages |
| integration-tests/src/mpc_fixture/builder.rs | Removes unused config_rx parameter |
| chain-signatures/node/src/types.rs | Simplifies ReshareProtocol by removing redundant fields and refresh method |
| chain-signatures/node/src/protocol/test_setup.rs | Subscribes to resharing_ready channel in test setup |
| chain-signatures/node/src/protocol/state.rs | Adds new state types for two-phase resharing with timeout tracking |
| chain-signatures/node/src/protocol/mod.rs | Removes unused config parameter from progress method |
| chain-signatures/node/src/protocol/message/types.rs | Defines ResharingReadyMessage type and protocol variant |
| chain-signatures/node/src/protocol/message/sub.rs | Adds subscription support for ResharingReady messages |
| chain-signatures/node/src/protocol/message/mod.rs | Implements inbox handling and subscription for resharing readiness |
| chain-signatures/node/src/protocol/cryptography.rs | Implements two-phase resharing logic with timeout and retry mechanisms |
| chain-signatures/node/src/protocol/consensus.rs | Updates consensus handling to use new resharing state structure |
| chain-signatures/node/src/cli.rs | Subscribes to resharing_ready channel and removes config_rx parameter |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| } | ||
|
|
||
| pub fn set_resharing_running_timeout(duration: Duration) { | ||
| RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst); |
There was a problem hiding this comment.
Using swap instead of store suggests the old value might be used, but it's discarded here. Use store instead for clarity since the return value is unused.
| RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst); | |
| RESHARING_RUNNING_TIMEOUT_SECS.store(duration.as_secs(), Ordering::SeqCst); |
There was a problem hiding this comment.
Super nitpicky but Copilot has a point about the semantics fitting better to store than swap. 😉
While at it, another detail is that with swap + Ordering::SeqCst, the compiler has to generate an atomic load with appropriate memory fences to ensure all potentially outstanding stores (or swaps) of other threads to RESHARING_RUNNING_TIMEOUT_SECS have been observed by this thread before this thread can perform its own swap. This can lead to quite expensive thread stalling and CPU cache invalidations.
Using store instead of swap would get rid of the load. I think that would already avoid CPU-level synchronization for x86 targets thanks to their relatively stronger memory model.
If you further change Ordering::SeqCst to Ordering::Relaxed`, we should be able to entirely avoid expensive CPU synchronization instructions on all platforms. This is safe since we don't actually use this variable for multi-threaded synchronization. We just need every get and set to be atomic. Relaxed load / store is the best choice for this use case.
But again, this isn't important in the grand scheme of things. Feel free to change it if you want or keep it as is. It won't make a noticeable difference either way.
volovyks
left a comment
There was a problem hiding this comment.
Wow, great improvement. And awesome test! Happy to merge it, not sure about those panic!()s.
| tokio::time::sleep(Duration::from_secs(30)).await; | ||
| assert!(matches!( | ||
| nodes.contract_state().await?, | ||
| mpc_contract::ProtocolContractState::Resharing(_) |
There was a problem hiding this comment.
How much effort would it take to fix these two issues? It is still possible to be "stuck" in the resharing state.
There was a problem hiding this comment.
It'll be easy to add in the cancel of resharing, but for parallel signing, we need to consider somethings like if we're kicking somebody, would it be a security concern if we allow them to sign?
There was a problem hiding this comment.
If we are kicking someone because of their misbehavior, the voting process will take several hours in the best-case scenario. The resharing protocol is relatively fast, so it does not matter much (5h vs 5h+1m).
The main reason we wanted parallel signing and resharing was possible downtime. Especially, downtime caused by failed resharing.
|
Added another test_resharing_running_participant_restart that does the restart in the middle of reshare is running. This had some other issues that arose like resharing consuming previous reshare messages. To fix that, each node during every awaiting step creates their own separate token to be broadcasted out with the ReadyMessage. These tokens are combined later when all new participants are ready to form a new unique token to this very specific reshare running phase such that new reshare running phase would have an entirely different token. So reshare messages match against these tokens, if the token is not correct, ignore the message. |
6c44e66 to
2246049
Compare
volovyks
left a comment
There was a problem hiding this comment.
Inventing a new approach for message filtering isn’t the best idea 🙂 But adopting a single universal approach for all protocols would be overkill.
jakmeier
left a comment
There was a problem hiding this comment.
Really nice! This is miles more robust than what we had before!
Long-term, keep in mind that this only solves crashes but not malicious behavior. I believe this approach would not work if, for example, one malicious node sends different my_token values to each peer.
We'll have to add it to the list of necessary improvements for byzantine fault-tolerance.
| } | ||
|
|
||
| pub fn set_resharing_running_timeout(duration: Duration) { | ||
| RESHARING_RUNNING_TIMEOUT_SECS.swap(duration.as_secs(), Ordering::SeqCst); |
There was a problem hiding this comment.
Super nitpicky but Copilot has a point about the semantics fitting better to store than swap. 😉
While at it, another detail is that with swap + Ordering::SeqCst, the compiler has to generate an atomic load with appropriate memory fences to ensure all potentially outstanding stores (or swaps) of other threads to RESHARING_RUNNING_TIMEOUT_SECS have been observed by this thread before this thread can perform its own swap. This can lead to quite expensive thread stalling and CPU cache invalidations.
Using store instead of swap would get rid of the load. I think that would already avoid CPU-level synchronization for x86 targets thanks to their relatively stronger memory model.
If you further change Ordering::SeqCst to Ordering::Relaxed`, we should be able to entirely avoid expensive CPU synchronization instructions on all platforms. This is safe since we don't actually use this variable for multi-threaded synchronization. We just need every get and set to be atomic. Relaxed load / store is the best choice for this use case.
But again, this isn't important in the grand scheme of things. Feel free to change it if you want or keep it as is. It won't make a noticeable difference either way.
| ready_tokens: std::iter::once((me, my_token)).collect(), | ||
| my_token, | ||
| // ready to broadcast immediately | ||
| broadcast_interval: Instant::now() - RESHARING_READY_BROADCAST_INTERVAL, |
There was a problem hiding this comment.
I think this can technically panic. Instant is an opaque type that may start counting at 0 when the program starts. (I think I've observed this with WASM targets.) If we get here in the first 10s of the program, subtraction will negative overflow.
It probably happens to be fine for our use case. But API wise, a std::time::Instant should only be used to compare it to another Instant, not to some global clock.
AFAIU, we probably cannot add BFT to the resharing step aside from restarting the reshare protocol with a different new participant set. To visualize this a bit, for resharing, we have old participant set and new participant set, where a subset of the old participants must also be in the new participants. For a participant to be able to get their new key share, they have to communicate. That means they need to be online, which also means all new participants have to be online to get their keyshare. So this would mean that new participants have to be honest. We can maintain up to threshold amount of old participants that must be honest, and can be included in the new participant set. If we're in resharing running step, and a node behaves maliciously, we really have to just kick them out, and then restart the whole resharing process again. This is due to cait-sith not allowing us to adjust the participants inflight during resharing |
|
Thanks for the details @ChaoticTempest ! I see, the resharing algorithm itself may not be suitable to be made BFT. But I believe we can still work on:
That said, I don't think making resharing BFT is a useful thing to work on right now. We can always fall back to manual intervention in a process that's only triggered manually anyways. As long as we know the limitations of it, I'm very happy with what we have now with this PR merged in. |
This adds a fix to the resharing process where we will timeout if it takes too long to progress. The progress is determined by the last action (i.e. receiving a new message, sending messages)
We will default to a timeout of 5 minutes from the last action which should give nodes plenty of time to restart if they are doing so.
Additionally, I made resharing a two phase process. One is awaiting ready, and the other is running the resharing.
The awaiting step is where we will constantly broadcast that we're ready to start every 10 seconds. When all new participants including us is ready, we will go to the next phase which is running the resharing.
So this fixes several cases where:
This also adds a new resharing test so we can see that it works properly when a node is offline and restarts.