Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more #18

RedaOps · 2024-01-25T02:24:54Z

Hi! This is a pretty interesting project. Read through the code and decided to help implement some stuff :)

Issues

closes ping & status #3
closes randomize and "concurrencize" the flow #4

Implemented Features

generate-committee fails if the provided directory doesn't exist, it now creates it by default
Keepalive between the orchestrator and MPC nodes with fibonacci backoff
Updated rpc requests log levels from info to debug. Imho, info level logs shouldn't be spammed by low level request messages
Made the orchestrator signing logic concurrent, random (maybe this will be replaced with a better selection algorithm in the future?) and fault tolerant
Added a status method to the orchestrator jsonrpc which returns the alive/dead MPC nodes (using the key identifier, not address, which could be private to the orchestrator).
Updated installation instructions - it was not very clear circom and snarkjs binaries are required to do stuff, so I changed the README to be more clear :)

Let's get into the exciting details!

`MemberStatus`

A new MemberStatus enum was implemented which is used to classify the state of an MPC member inside the orchestrator:

Online - The node is online and responsive
Disconnected - The node is unresponsive but there will be reconnection attempts with fibonacci delays
Offline - The node is unresponsive and reconnection will not be attempted anymore. This can happen either because we tried reconnecting too many times in the Disconnected state, or because the node generated a round 1 signing commitment but didn't manage to complete round 2, meaning it is very unreliable and maybe is running a different version than the other nodes.

The status of each node will be held inside a RwLock<MemberStatusState> structure which is passed inside the jsonrpsee context.

`status` jsonrpc method

Pretty self explanatory, this is what the status method returns:

curl -X POST http://127.0.0.1:8890 -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":"test","method":"status","params":[11]}'

{"jsonrpc":"2.0","result":{"online_members":["0000000000000000000000000000000000000000000000000000000000000001"],"offline_members":["0000000000000000000000000000000000000000000000000000000000000002","0000000000000000000000000000000000000000000000000000000000000003"]},"id":"test"}

Fibonacci backoff keepalive

When the orchestrator starts, it will ping the newly implemented ping endpoint of each MPC member and determine if it's alive or not. It will keep on querying the ping endpoint every KEEPALIVE_WAIT_SECONDS seconds (defined in constants.rs). It will only ping Online or Disconnected nodes and update the status accordingly. The fibonacci backoff means that if a node doesn't respond (is in the Disconnected state), it will try again in 5, 8, 13, 21 etc seconds until it reaches a KEEPALIVE_MAX_RETRIES number of retries, when it will classify the node as Offline.

New concurrent fault-tolerant signing logic

This is how the new logic works and is pretty fault tolerant:

The orchestrator generates a list of Online nodes and shuffles them.
If the number of available nodes is below the required threshold, the orchestrator returns a not enough available signers error.
The orchestrator picks first n nodes required for the threshold.
It then does the first signing round concurrently.
When checking the results, if any node has not returned a commitment, it classifies that node as Disconnected and jumps back to step 1. This way, it will try again with a new set of nodes.
Otherwise, it then does the second signing round concurrently.
When checking the results, if any node has not returned a signature, it classifies that node as Offline and jumps back to step 1.
Otherwise, it generates the transaction and returns Ok.

Testing

I have tested every new feature, but I highly encourage you to test it out more!

In order to perform the tests for the fault tolerance, I created some new committee keys and spun up my own orchestrator and MPC nodes with a new ZKBITCOIN address. These are the tests I have performed and the results:

Test 1

I tried "using" a zkapp while only 1/3 MPC nodes were online. I have received the following error, which is expected:

Error: error while sending request to orchestrator

Caused by:
    0: bob request failed
    1: RPC error response: RpcError { code: -32001, message: "error while unlocking funds", data: Some(RawValue("the request didn't validate: not enough available signers")) }

Test 2

I spun up 2/3 nodes, but with a catch. One of the running nodes is going to panic when asked for round 2:

async fn round_2_signing(
    params: Params<'static>,
    context: Arc<NodeState>,
) -> RpcResult<Round2Response> {
    panic!("Panic for fun") // panic for testing
    // get commitments from params
    let round2request: [Round2Request; 1] = params.parse()?;
    let round2request = &round2request[0];
    info!("received request: {:?}", round2request);

The orchestrator didn't crash and just returned the same error as in test 1. This is because it tried doing the signing logic, but failed on step 2 and classified the failing node as Offline. It then tried again, but only saw 1/3 available nodes and returned the error.

[2024-01-25T01:33:07Z WARN  zkbitcoin::committee::orchestrator] Round 2 error with http://127.0.0.1:8891, marking as offline and retrying from round 1: error sending request for url (http://127.0.0.1:8891/): connection closed before message completed
[2024-01-25T01:33:09Z WARN  zkbitcoin::committee::orchestrator] http://127.0.0.1:8893 is offline. Re-trying in 34 seconds... (4/10)

Test 3 - the most important test

I also did test 2 but with 3/3 nodes running, and only one of them will crash when asked for round 1 commitment. What happened is that it actually did randomly selected the faulty node, but when it didn't receive a commitment from it, it classified that node as Disconnected, and then tried again with the other 2 nodes and successfully generated a signature and submitted the spend transaction. This is great fault tolerance!

Test 4

I spun up 2/3 nodes with no catches. This should work in this scenario, and it did.

Here is my ZKBITCOIN address: https://mempool.space/testnet/address/tb1przdyzca6zxlykmas4tvdum8qtac3m0ppsfm9p8akfckqkwnw07xs2u9cwf
There are 4 transactions: 2 deploys, one spent on test 3 and one spent on test 4

If you have something to add or any suggestions on what I implemented I would love to hear them!

mimoo · 2024-01-27T01:27:57Z

Looking through the PR now! This is amazing already so a big thank-you again for this PR :D

also for testing crashes and things like that, do you know https://crates.io/crates/fail? Maybe it would be a good idea to create some tests with it (with #20)

RedaOps · 2024-01-27T01:44:13Z

Yeah, that's definitely something we should look into. I can do some research and open another PR once I find a good solution. Then we can even include the e2e and the normal tests in the CI pipeline.

mimoo

beautiful!

mimoo · 2024-01-27T01:06:43Z

src/committee/node.rs

-async fn is_alive(_params: Params<'static>, _context: Arc<NodeState>) -> String {
-    "hello".to_string()
+async fn is_alive(params: Params<'static>, _context: Arc<NodeState>) -> RpcResult<u64> {
+    Ok(params.parse::<[u64; 1]>()?[0].clone())


is this actually best practice in order to ensure it's not a cached response or smthg?

Yup, a lot of APIs use the current timestamp in ping keepalive requests, so yeah it can be considered a standard. In this case it's more of a sanity check, but just in case the node has a weird caching reverse proxy set up, this is good to have.

mimoo · 2024-01-27T01:32:05Z

src/committee/orchestrator.rs

+                .iter()
+                .map(|(_, member)| async {
+                    let rpc_ctx =
+                        RpcCtx::new(Some("2.0"), None, Some(member.address.clone()), None, None);


note that default timeout atm is 10s, which is quite a lot :D

mimoo · 2024-01-27T01:34:14Z

src/committee/orchestrator.rs

+                        warn!("Round 1 error with {}, marking as disconnected and retrying round 1: {rpc_error}", member.address);
+                        let mut ms_w = member_status.write().unwrap();
+                        ms_w.mark_as_disconnected(member_id);
+                        continue 'retry;


we actually don't have to restart the whole process at this point, we can also ask more people (including the disconnected peer perhaps), but I think this is good enough and simpler!

You're right! We can definitely do that. Right now it's fine, because even with like 5-10 MPC nodes will be fast enough, but when there will be more nodes the requests that already succeeded are redundant.

mimoo · 2024-01-27T01:36:05Z

src/committee/orchestrator.rs

+                };
+
+                let response: bitcoincore_rpc::jsonrpc::Response = serde_json::from_str(&resp)?;
+                let resp: Round1Response = response.result()?;


this deserialization issue will prevent the process from continuing. I think it'd be nice to treat these as a disconnected member as well (perhaps with a different log msg)

You are correct! Also, maybe we can use custom structs for the response. Both here and other MPC node responses.

mimoo · 2024-01-27T01:41:36Z

src/committee/orchestrator.rs

+                };
+
+                let response: bitcoincore_rpc::jsonrpc::Response = serde_json::from_str(&resp)?;
+                let round2_response: Round2Response = response.result()?;


same issue here, I think we should retry instead of returning an error

mimoo · 2024-01-27T01:43:04Z

src/committee/orchestrator.rs

+
+            // return the signed transaction
+            return Ok(BobResponse {
+                unlocked_tx: transaction,


note that we could return the witness instead and have bob include it in the signature by themselves (it would be a smaller response)

mimoo · 2024-01-27T01:45:40Z

src/committee/orchestrator.rs

@@ -36,12 +40,235 @@ use super::node::{Round2Request, Round2Response};
 // Orchestration logic
 //

+type RpcOrchestratorContext = Arc<(Orchestrator, Arc<RwLock<MemberStatusState>>)>;


maybe cleaner if Orchestrator has a field with this RwLock inside

Tried to do that but had some issues with jsonrpsee's context automatically wrapping the struct passed with an Arc. I tried to avoid double Arcs but it wasn't really possible since I already needed an Arc of the RwLock before passing it to the RpcModule, and the RpcModule has not function to fetch a clone of the Arc. But hey, we have double Arc anyway so we can definitely do what you suggested.

mimoo · 2024-01-27T01:53:32Z

src/committee/orchestrator.rs

+// to wait for a read lock every time an rpc handler needs to access the config.
+// This way, handlers will only wait for read locks when they need the status of the members.
+pub struct MemberStatusState {
+    pub key_to_addr: HashMap<frost_secp256k1_tr::Identifier, String>,


note that now we have this information in two different places, it already exists in Orchestrator.committee_cfg.members

Yup. Probably best if we moved the status logic to the Orchestrator struct so it would always have access to that map, but figured this would be better for separating this auxiliary functionality. The map is not going to grow large anyway.

mimoo · 2024-01-27T01:55:15Z

src/committee/orchestrator.rs

+                    .key_to_addr
+                    .iter()
+                    .map(|(key, addr)| {
+                        let status = r_lock.status.get(key).unwrap().clone();


maybe cleaner:

let matcher = |s| matches!(s, MemberStatus::Online) | matches!(MemberStatus:Disconnected((r, _)) if r < unixtime()); r_lock.status.value().filter(matcher).collect_vec()

We can filter first using the cleaner matches! macro, but we would need the map for the next part of the code

mimoo · 2024-01-27T02:10:18Z

src/committee/orchestrator.rs

+                    let member_status = state_w.status.get_mut(key).unwrap();
+                    let last_retries = match old_status {
+                        MemberStatus::Disconnected((_, last_retry_number)) => *last_retry_number,
+                        _ => 0,


note that it is possible that the node was marked as offline in the mean time (due to a failure to respond in round 2 of a signature), so here we might want to have

match old_status { MemberStatus::Disconnected((_, last_retry_number)) => *last_retry_number, MemberStatus::Online => 0, MemberStatus::Offline => continue, }

RedaOps added 6 commits January 24, 2024 21:29

generate-committee: create dir if it does not exist

a574b1e

Implemented keepalive + fibonacci backoff

f208596

changed orchestrator log levels to debug for rpc requests

e798c66

made orchestrator signing concurrent and implemented retry mechanism

3a6dd65

added status jsonrpc method to orchestrator

93aabd3

Updated installation instructions

20e1b3f

mimoo mentioned this pull request Jan 27, 2024

have a wait to quickly test a random MPC committee #20

Open

mimoo approved these changes Jan 27, 2024

View reviewed changes

mimoo merged commit 19405dc into sigma0-dev:main Jan 27, 2024

This was referenced Jan 27, 2024

randomize and add concurrency to selecting members #10

Closed

[WIP] implement concurrent orchestrator #9

Closed

RedaOps added a commit to RedaOps/zkbitcoin that referenced this pull request Jan 27, 2024

sigma0-dev#18 comments + fixed clippy

4a72360

RedaOps added a commit to RedaOps/zkbitcoin that referenced this pull request Jan 27, 2024

sigma0-dev#18 comments

3e71725

RedaOps mentioned this pull request Jan 27, 2024

Fix Clippy + comments from previous PR #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more #18

Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more #18

RedaOps commented Jan 25, 2024

mimoo commented Jan 27, 2024

RedaOps commented Jan 27, 2024

mimoo left a comment

mimoo Jan 27, 2024

RedaOps Jan 27, 2024

mimoo Jan 27, 2024

mimoo Jan 27, 2024

RedaOps Jan 27, 2024

mimoo Jan 27, 2024

RedaOps Jan 27, 2024

mimoo Jan 27, 2024

mimoo Jan 27, 2024

mimoo Jan 27, 2024

RedaOps Jan 27, 2024

mimoo Jan 27, 2024

RedaOps Jan 27, 2024

mimoo Jan 27, 2024

RedaOps Jan 27, 2024

mimoo Jan 27, 2024

Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more #18

Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more #18

Conversation

RedaOps commented Jan 25, 2024

Issues

Implemented Features

MemberStatus

status jsonrpc method

Fibonacci backoff keepalive

New concurrent fault-tolerant signing logic

Testing

Test 1

Test 2

Test 3 - the most important test

Test 4

mimoo commented Jan 27, 2024

RedaOps commented Jan 27, 2024

mimoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`MemberStatus`

`status` jsonrpc method