-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more #18
Conversation
Looking through the PR now! This is amazing already so a big thank-you again for this PR :D also for testing crashes and things like that, do you know https://crates.io/crates/fail? Maybe it would be a good idea to create some tests with it (with #20) |
Yeah, that's definitely something we should look into. I can do some research and open another PR once I find a good solution. Then we can even include the e2e and the normal tests in the CI pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful!
async fn is_alive(_params: Params<'static>, _context: Arc<NodeState>) -> String { | ||
"hello".to_string() | ||
async fn is_alive(params: Params<'static>, _context: Arc<NodeState>) -> RpcResult<u64> { | ||
Ok(params.parse::<[u64; 1]>()?[0].clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually best practice in order to ensure it's not a cached response or smthg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, a lot of APIs use the current timestamp in ping keepalive requests, so yeah it can be considered a standard. In this case it's more of a sanity check, but just in case the node has a weird caching reverse proxy set up, this is good to have.
.iter() | ||
.map(|(_, member)| async { | ||
let rpc_ctx = | ||
RpcCtx::new(Some("2.0"), None, Some(member.address.clone()), None, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that default timeout atm is 10s, which is quite a lot :D
warn!("Round 1 error with {}, marking as disconnected and retrying round 1: {rpc_error}", member.address); | ||
let mut ms_w = member_status.write().unwrap(); | ||
ms_w.mark_as_disconnected(member_id); | ||
continue 'retry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually don't have to restart the whole process at this point, we can also ask more people (including the disconnected peer perhaps), but I think this is good enough and simpler!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! We can definitely do that. Right now it's fine, because even with like 5-10 MPC nodes will be fast enough, but when there will be more nodes the requests that already succeeded are redundant.
}; | ||
|
||
let response: bitcoincore_rpc::jsonrpc::Response = serde_json::from_str(&resp)?; | ||
let resp: Round1Response = response.result()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this deserialization issue will prevent the process from continuing. I think it'd be nice to treat these as a disconnected member as well (perhaps with a different log msg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct! Also, maybe we can use custom structs for the response. Both here and other MPC node responses.
}; | ||
|
||
let response: bitcoincore_rpc::jsonrpc::Response = serde_json::from_str(&resp)?; | ||
let round2_response: Round2Response = response.result()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue here, I think we should retry instead of returning an error
|
||
// return the signed transaction | ||
return Ok(BobResponse { | ||
unlocked_tx: transaction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we could return the witness instead and have bob include it in the signature by themselves (it would be a smaller response)
@@ -36,12 +40,235 @@ use super::node::{Round2Request, Round2Response}; | |||
// Orchestration logic | |||
// | |||
|
|||
type RpcOrchestratorContext = Arc<(Orchestrator, Arc<RwLock<MemberStatusState>>)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe cleaner if Orchestrator
has a field with this RwLock inside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to do that but had some issues with jsonrpsee's context automatically wrapping the struct passed with an Arc. I tried to avoid double Arcs but it wasn't really possible since I already needed an Arc of the RwLock before passing it to the RpcModule
, and the RpcModule
has not function to fetch a clone of the Arc. But hey, we have double Arc anyway so we can definitely do what you suggested.
// to wait for a read lock every time an rpc handler needs to access the config. | ||
// This way, handlers will only wait for read locks when they need the status of the members. | ||
pub struct MemberStatusState { | ||
pub key_to_addr: HashMap<frost_secp256k1_tr::Identifier, String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that now we have this information in two different places, it already exists in Orchestrator.committee_cfg.members
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Probably best if we moved the status logic to the Orchestrator struct so it would always have access to that map, but figured this would be better for separating this auxiliary functionality. The map is not going to grow large anyway.
.key_to_addr | ||
.iter() | ||
.map(|(key, addr)| { | ||
let status = r_lock.status.get(key).unwrap().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe cleaner:
let matcher = |s| matches!(s, MemberStatus::Online) | matches!(MemberStatus:Disconnected((r, _)) if r < unixtime());
r_lock.status.value().filter(matcher).collect_vec()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can filter first using the cleaner matches!
macro, but we would need the map for the next part of the code
let member_status = state_w.status.get_mut(key).unwrap(); | ||
let last_retries = match old_status { | ||
MemberStatus::Disconnected((_, last_retry_number)) => *last_retry_number, | ||
_ => 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that it is possible that the node was marked as offline in the mean time (due to a failure to respond in round 2 of a signature), so here we might want to have
match old_status {
MemberStatus::Disconnected((_, last_retry_number)) => *last_retry_number,
MemberStatus::Online => 0,
MemberStatus::Offline => continue,
}
Hi! This is a pretty interesting project. Read through the code and decided to help implement some stuff :)
Issues
Implemented Features
generate-committee
fails if the provided directory doesn't exist, it now creates it by defaultinfo
todebug
. Imho, info level logs shouldn't be spammed by low level request messagesstatus
method to the orchestrator jsonrpc which returns the alive/dead MPC nodes (using the key identifier, not address, which could be private to the orchestrator).circom
andsnarkjs
binaries are required to do stuff, so I changed the README to be more clear :)Let's get into the exciting details!
MemberStatus
A new
MemberStatus
enum was implemented which is used to classify the state of an MPC member inside the orchestrator:Online
- The node is online and responsiveDisconnected
- The node is unresponsive but there will be reconnection attempts with fibonacci delaysOffline
- The node is unresponsive and reconnection will not be attempted anymore. This can happen either because we tried reconnecting too many times in theDisconnected
state, or because the node generated a round 1 signing commitment but didn't manage to complete round 2, meaning it is very unreliable and maybe is running a different version than the other nodes.The status of each node will be held inside a
RwLock<MemberStatusState>
structure which is passed inside thejsonrpsee
context.status
jsonrpc methodPretty self explanatory, this is what the status method returns:
Fibonacci backoff keepalive
When the orchestrator starts, it will ping the newly implemented
ping
endpoint of each MPC member and determine if it's alive or not. It will keep on querying the ping endpoint everyKEEPALIVE_WAIT_SECONDS
seconds (defined inconstants.rs
). It will only pingOnline
orDisconnected
nodes and update the status accordingly. The fibonacci backoff means that if a node doesn't respond (is in theDisconnected
state), it will try again in 5, 8, 13, 21 etc seconds until it reaches aKEEPALIVE_MAX_RETRIES
number of retries, when it will classify the node asOffline
.New concurrent fault-tolerant signing logic
This is how the new logic works and is pretty fault tolerant:
Online
nodes and shuffles them.not enough available signers
error.n
nodes required for the threshold.Disconnected
and jumps back to step 1. This way, it will try again with a new set of nodes.Offline
and jumps back to step 1.Testing
I have tested every new feature, but I highly encourage you to test it out more!
In order to perform the tests for the fault tolerance, I created some new committee keys and spun up my own orchestrator and MPC nodes with a new
ZKBITCOIN
address. These are the tests I have performed and the results:Test 1
I tried "using" a zkapp while only 1/3 MPC nodes were online. I have received the following error, which is expected:
Test 2
I spun up 2/3 nodes, but with a catch. One of the running nodes is going to panic when asked for round 2:
The orchestrator didn't crash and just returned the same error as in test 1. This is because it tried doing the signing logic, but failed on step 2 and classified the failing node as
Offline
. It then tried again, but only saw 1/3 available nodes and returned the error.Test 3 - the most important test
I also did test 2 but with 3/3 nodes running, and only one of them will crash when asked for round 1 commitment. What happened is that it actually did randomly selected the faulty node, but when it didn't receive a commitment from it, it classified that node as
Disconnected
, and then tried again with the other 2 nodes and successfully generated a signature and submitted the spend transaction. This is great fault tolerance!Test 4
I spun up 2/3 nodes with no catches. This should work in this scenario, and it did.
Here is my
ZKBITCOIN
address: https://mempool.space/testnet/address/tb1przdyzca6zxlykmas4tvdum8qtac3m0ppsfm9p8akfckqkwnw07xs2u9cwfThere are 4 transactions: 2 deploys, one spent on test 3 and one spent on test 4
If you have something to add or any suggestions on what I implemented I would love to hear them!