[Research]: Signer Coordinator Selection and Failover #39

8marz8 · 2024-04-05T20:31:04Z

Completing the issue description and arriving at a conclusion is the deliverable of this issue.

Research - Signer Coordinator Selection and Failover

This ticket holds the research relating to Signer Coordinator Selection and Failover and how it impacts sBTC V1.

1. Summary

Stemming from issue #37, we concluded that it's best to have a VRF coordinator selection executed within the signers. It seems that we can mostly leverage the current coordinator code in stacks_signer with a slight difference of using the Bitcoin ConsensusHash that the user's request is tied to instead of the signer's view of the Bitcoin ConsensusHash from burnchain tip.

2. Context & Relevance

The need for a coordinator in signers exists in both the Nakamoto and sBTC releases as the introduction of signers as pivotal entities to process and approve certain operations requires synchronization and collective consensus.
In context of Nakamoto release, the signers run a DKG round to collectively agree on signing or rejecting blocks (using the DKG aggregate key). A coordinator is needed to trigger and finalize these commands on behalf of all the signers.
In context of sBTC, signers act as collective actors to govern the peg mechanism. Hence, they need to collectively agree on honoring or rejecting the peg request.

In Nakamoto, there were some iterations on how to calculate the coordinator:

Pick the 0th signer as the coordinator.
Introduce a VRF selection to randomly but deterministically choose a signer by calling the /v2/info RPC endpoint to fetch the ConsensusHash for the latest stacks tip to hash with signer public keys and pick the smallest corresponding ID.
The above approach created issues when more end-to-end testing was done as the fast spinning of signers and chain boot-up would cuz different signers to have a different view of the Stacks chain and thus calculate a different coordinator. This would halt the tests (and possibly real production) as DKG would never get calculated. Burnchain ConsensusHash was used as an alternative to pick an infrequent value but this also didn't resolve the issue.
Coordinator selection was deemed to be more complicated...some ideas around BFT (Honey Badger) was thrown around but that would be a lot of risky effort and as a resolution, the Nakamoto team revisited an old forgotten suggestion of using Miners as Coordinators for signing. However, signers would still need a coordinator for DKG round since that's a pre-requisite to block signing so for that, a constant ConsensusHash is being used to calculate a list of coordinator Ids.

3. Research

Some options to consider are:

Can we use transaction ID for our VRF selection?
Can the same miner as coordinator approach be utilized in sBTC?
Leverage most of the signer coordinator code, but instead of relying on the signer's view of the burnchain ConsensusHash, we use the Burnchain ConsensusHash or another ID that the request is tied to.

3.1.1 Use transaction ID for our VRF
This idea was brought up to prevent the issue in which signers have different views of the chain, although this would lead to a frequent coordinator selection per transaction, it will prevent transaction batching efforts and can be costly.

3.1.2 Miner as Coordinator
I don't think this can be used in sBTC V1 by itself since in the case of Nakamoto, the miner acts as coordinator, trying to get the signers to sign its block but in sBTC V1, the peg mechanism and its processing will happen separately and before block production. (could be very wrong here)

3.1.3 Calculate VRF using burnchain data tied to transactions
This is partly related to the 3.1.1 suggestion, but instead of calculating the VRF per transaction ID, we can use the Burnchain block info that the batched transactions are linked to as our VRF parameter. This way, the coordinator will change every time a new Burnchain block is produced which is not as frequent and batching can still be done.
We won't run into the issue of signers having different views of the burnchain because in this case, they won't hit the RPC endpoint and rely on their own view of the burnchain, but get that info from the transaction itself.
Note: Here I am making a big assumption that somehow the Bitcoin info can come with transaction info. For the case of deposit, triggered either from sBTC contract or Deposit API, we have to confirm the BTC transaction is made and is materialized on Bitcoin so that info should be calculated at some point. In the case of withdrawals, the latest discussions are leaning towards allowing withdrawals on burnchain to avoid fork issues...so I assume the block info would be accessible in this case as well.

3.1 Proposed Research Conclusions

Proposing option 3 for the reasons stated above but it relies heavily on the assumption that the Burnchain block data can be retrieved for both deposit and withdrawal transactions - ideally without extra overhead.

3.2 External Resources

Coordinator Election Algorithms

3.3 Areas of Ambiguity

Is the proposed solution still susceptible to the coordinator mismatch issues faced in Nakamoto dev?
If we go with allowing withdrawals from sBTC contract calls, will this solution still be applicable? i.e. Can we still tie a withdrawal request to a Burnchain block?
Is a failover strategy achievable in this way? What's the current behavior in Nakamoto?

Closing Checklist

The takeaway from this issue is clearly documented in the description of this ticket.
Everyone necessary has reviewed the resolution and agrees with the takeaways.
This ticket has or links all the information necessary to familiarize a contributor with the topic and how it was resolved.

The text was updated successfully, but these errors were encountered:

hstove · 2024-04-08T17:59:39Z

I think we can just use the latest Bitcoin block hash / consensus hash as the VRF "key", just like in Nakamoto for DKG. We may need similar behavior for handling the case where signers have different views of the current consensus hash. I wrote this before remembering that we don't even use BTC block hash in Nakamoto DKG, due to the complexities of it.

I don't think it's a good idea to try and match deposit/withdrawal with block IDs, because we'll have scenarios where we'll be processing multiple deposits/withdrawals from different blocks.

For failover, my understanding is that we haven't implemented this in Nakamoto, partially because we moved to "miner as coordinator" for blocks. I don't believe we've implemented coordinator failover for DKG, but I'm also not sure.

8marz8 · 2024-04-09T13:39:47Z

I think we can just use the latest Bitcoin block hash / consensus hash as the VRF "key", just like in Nakamoto for DKG.

Sure, we can revisit that approach but my concern was that it was never fully resolved.

I assumed sequential batching but it makes sense if it's more realistic to batch/process requests from different blocks - especially after the withdrawal discussion since we moved from 1 block processing to 6 blocks.

For failover behavior - which we currently don't have in Nakamoto for the DKG case, and it seems the backup is the same static coordinator selection if miner-as-coordinator can't be picked.
Just recording my thoughts here...even with the timeout and last_message_seen parameters, if a coordinator is not responsive and exceeds the timeout, I am thinking maybe not all the signers will register that exactly at the same time and agree on the next coordinator (next ID in the pre-calculated coordinator list). So to me it seems the safest route for making sure that all the signers agree on the same coordinator (for the VRF selection and for a failover), we can't just rely on their isolated selection, but we might need some sort of voting/event emitting/agreement and once we know all signers have the same view of the coordinator, then the DKG or whatever coordinator related operation can be processed.

hstove · 2024-04-11T15:00:48Z

This relates to the lock time needed for deposit UTXOs. If we switch coordinators every N blocks, but we have a locktime of N-M, all that's needed to fail is for a single signer to be offline for N blocks.

That's totally OK, I think it just means that we need to have coordinator selection / fallback faster than our lock time.

netrome · 2024-04-11T15:00:58Z

the signers run a DKG round to collectively agree on signing or rejecting blocks

Nit: This is a signing round, not a DKG round. I believe we should handle coordinator selection separately for those two cases.

netrome · 2024-04-11T15:02:44Z

The above approach created issues when more end-to-end testing was done as the fast spinning of signers and chain boot-up would cuz different signers to have a different view of the Stacks chain and thus calculate a different coordinator. This would halt the tests (and possibly real production) as DKG would never get calculated. Burnchain ConsensusHash was used as an alternative to pick an infrequent value but this also didn't resolve the issue.

The "separate chain view" issue is a problem for DKG, but not for signing rounds. In signing rounds everyone agrees on which bitcoin block a request is in - and therefore have a consistent number to base the VRF on.

netrome · 2024-04-11T15:03:26Z

In sBTC v1, DKG is (probably) a manual process - so I am not too worried about that initially. Signing rounds need some automatic coordinator selection though.

netrome · 2024-04-11T15:04:48Z

Right yeah, that's essentially what Option 3 says

AshtonStephens · 2024-04-11T15:20:54Z

Possibly use Deposit API to select the coordinator. Possibly

It turns out this is incredibly complicated

netrome · 2024-04-11T18:15:58Z

Okay after our discussions in the meeting today I've been thinking more on this and I believe we could achieve a feasible and sensible solution for coordinator failover.

Recap

Coordinator selection is solved by the proposed VRF in 3. This gives us a unique coordinator $C(B)$ for any bitcoin block $B$. The problem we are considering is how to handle if this coordinator fails to fulfill its duties. We were discussing potential failover mechanisms. While it is easy to define an ordering of the coordinator so that everyone knows who the primary, secondary, tertiary etc. coordinator is. The challenge is getting everyone agreeing at a point in time who is coordinating.

While this is something we can probably find a workable solution, doing so would likely entail implementing some complex protocol like raft or paxos - or build on top of a system already implementing this like Zookeeper or Etcd. Any of these drags a lot of complexity in to the application. In addition, these systems are vulnerable to byzantine actors (might be acceptable for v1).

Another path is to side-step the issue by leveraging the fact that Bitcoin already provides a point of synchronization. At each block, every signer has a consistent view of the underlying blockchains. This is how we solve coordinator selection in the first place in 3.

Proposal

Let there only be a single coordinator per bitcoin block, but let the next coordinator be responsible for any missed deposit and withdrawal requests in previous blocks. This means that if a coordinator is offline or failing to fulfill it's duties, we will have a delay in processing any deposit requests that this coordinator should handle. If we accept this potential delay in our system, we'll get in return the ability for all signers to agree on a single coordinator at any point in time.

Note that it is still possible for the signers to have a divergent view of the underlying bitcoin blockchain. Two signers may simultaneously believe that they should coordinate a deposit request etc. However, all signers would unambiguously know which coordinator to ignore and which coordinator to respond to. For example, say we have bitcoin blocks $B_A \leftarrow B_B$ and coordinators $C(B_A)$ and $C(B_B)$, where the first coordinator has not yet observed block $B_B$ and still has pending requests. Both of the coordinators will therefore try to coordinate a signing round for the same requests. This can result in two scenarios.

The coordinator $C(B_A)$ successfully coordinates a signing round before 30% of signers observe bitcoin block $B_B$. Then this coordinator will succeed and the transaction is broadcasted. In this scenario, the coordinator $C(B_B)$ may still be able to successfully run a signing round¹ but will be able to observe the transaction from the previous coordinator in mempool and therefore should not broadcast its transaction².
More than 30% of signers have seen block $B_B$ when they get the signing request from Coordinator A. At this point, they reject the request, awaiting coordinator $C(B_B)$ to take over.

One could argue that signers should reject requests for Coordinator B if they have already signed for Coordinator A, but this would be problematic in 50/50 splits since the system would deadlock. Instead, each signer should always comply with the current coordinator according to their chain view (and naturally update their chain view if they hear from a coordinator from a block they don't know of). This naturally creates an ordering where coordinators of higher blocks take precedence over older coordinators. ↩
Note that even if the coordinator broadcasts its transaction it should not be any problem. We'd have two conflicting but valid transactions and only one of them would be mined. ↩

hstove · 2024-04-11T18:38:40Z

Note that even if the coordinator broadcasts its transaction it should not be any problem. We'd have two conflicting but valid transactions and only one of them would be mined.

Yeah, and unless the transaction was explicitly signed to be an RBF (which is not the default), it would get rejected as soon as it's broadcasted

Overall this sounds doable! As long as things like signing rounds can happen without every signer participating, I think this would lead to eventual consistency.

djordon · 2024-04-11T18:47:46Z

Another path is to side-step the issue by leveraging the fact that Bitcoin already provides a point of synchronization. At each block, every signer has a consistent view of the underlying blockchains.

Could we also use the current miner as a point of synchronization? This would allow us to piggyback off of whatever code is used to know who the current miner is.

netrome · 2024-04-11T19:13:08Z

Could we also use the current miner as a point of synchronization? This would allow us to piggyback off of whatever code is used to know who the current miner is.

Since we don't want to break consensus for v1 we can't enforce any miner behavior, so unfortunately I don't think such a solution would be viable.

djordon · 2024-04-11T19:53:28Z

Since we don't want to break consensus for v1 we can't enforce any miner behavior, so unfortunately I don't think such a solution would be viable.

Oh, I meant we just use the miner's ID, public key, or whatever unique identifier as the point of synchronization, but we do not involve the miner in the scheme in any other way. My thinking was that the signers must have some way of identifying that a Stacks block was from the right miner. Maybe that identifier is their public key or something, and we can just use that as input into our VRF selection process.

This approach introduces some skew into the coordinator selection but it might be a little easier to implement.

netrome · 2024-04-12T11:55:02Z

Oh, I meant we just use the miner's ID, public key, or whatever unique identifier as the point of synchronization, but we do not involve the miner in the scheme in any other way. My thinking was that the signers must have some way of identifying that a Stacks block was from the right miner. Maybe that identifier is their public key or something, and we can just use that as input into our VRF selection process.

This approach introduces some skew into the coordinator selection but it might be a little easier to implement.

Right, that's technically correct. The miner is selected using a VRF on every bitcoin block that has at least one leader block commit from a miner candidate. Anyone observing bitcoin will be able to run the same VRF to determine the miner, and could use the public key of the miner to run yet another VRF to decide the coordinator.

However, comparing the two options - simply running a VRF on the bitcoin block hash is much less complex and would always guarantee that we have a selected coordinator per block - while in the other case we'd have to parse all leader block commits, run two VRFs and we'd risk not having a new coordinator selected if no one is mining.

djordon · 2024-04-12T13:04:09Z

Anyone observing bitcoin will be able to run the same VRF to determine the miner, and could use the public key of the miner to run yet another VRF to decide the coordinator.

Oh I didn't know that the signers were already observing bitcoin. With that in mind, using the miner's public key is more complicated than using the bitcoin block directly.

hstove · 2024-04-12T13:15:37Z

Oh I didn't know that the signers were already observing bitcoin.

Yeah! Maybe more specifically, Stacks nodes monitor Bitcoin, and signers are connected to a Stacks node, so this information is available.

netrome · 2024-04-12T13:20:53Z

Yeah! Maybe more specifically, Stacks nodes monitor Bitcoin, and signers are connected to a Stacks node, so this information is available.

I believe the Signers will have to monitor Bitcoin directly themselves as well (or connect to a Bitcoin node or explorer). They need to:

Create and broadcast bitcoin transactions.
Validate deposit requests.
Monitor status of transactions and rbf if they aren't getting mined.

netrome · 2024-04-12T15:16:01Z

We've agreed on the proposal of using a VRF based on Bitcoin block ID. Every bitcoin block will have a designated coordinator. If a coordinator does not process any requests, the next coordinator takes over. If two coordinators are requesting signing rounds for the same requests, the coordinator with higher block ID takes precedence.

8marz8 added the research consolidating information. label Apr 5, 2024

8marz8 self-assigned this Apr 5, 2024

AshtonStephens added this to the High Level Design milestone Apr 8, 2024

netrome mentioned this issue Apr 10, 2024

[Design]: Bootstrap signer components #44

Closed

3 tasks

AshtonStephens modified the milestones: High Level Design, Low Level Design Apr 11, 2024

netrome closed this as completed Apr 12, 2024

AshtonStephens added sbtc signer binary The sBTC Bootstrap Signer. signer coordination The actions executed by the signer coordinator. signer communication Communication across sBTC bootstrap signers. signer state model The sBTC bootstrap signer state model. labels Apr 14, 2024

AshtonStephens added this to sBTC May 22, 2024

github-project-automation bot moved this to Needs Triage in sBTC May 22, 2024

AshtonStephens moved this from Needs Triage to Done in sBTC May 22, 2024

djordon mentioned this issue Sep 2, 2024

feat: add handler for smart contract events #474

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Research]: Signer Coordinator Selection and Failover #39

[Research]: Signer Coordinator Selection and Failover #39

8marz8 commented Apr 5, 2024 •

edited by AshtonStephens

Loading

hstove commented Apr 8, 2024 •

edited

Loading

8marz8 commented Apr 9, 2024

hstove commented Apr 11, 2024

netrome commented Apr 11, 2024

netrome commented Apr 11, 2024

netrome commented Apr 11, 2024

netrome commented Apr 11, 2024

AshtonStephens commented Apr 11, 2024 •

edited

Loading

netrome commented Apr 11, 2024 •

edited

Loading

hstove commented Apr 11, 2024

djordon commented Apr 11, 2024 •

edited

Loading

netrome commented Apr 11, 2024

djordon commented Apr 11, 2024

netrome commented Apr 12, 2024

djordon commented Apr 12, 2024

hstove commented Apr 12, 2024

netrome commented Apr 12, 2024

netrome commented Apr 12, 2024

[Research]: Signer Coordinator Selection and Failover #39

[Research]: Signer Coordinator Selection and Failover #39

Comments

8marz8 commented Apr 5, 2024 • edited by AshtonStephens Loading

Research - Signer Coordinator Selection and Failover

1. Summary

2. Context & Relevance

3. Research

3.1 Proposed Research Conclusions

3.2 External Resources

3.3 Areas of Ambiguity

hstove commented Apr 8, 2024 • edited Loading

8marz8 commented Apr 9, 2024

hstove commented Apr 11, 2024

netrome commented Apr 11, 2024

netrome commented Apr 11, 2024

netrome commented Apr 11, 2024

netrome commented Apr 11, 2024

AshtonStephens commented Apr 11, 2024 • edited Loading

netrome commented Apr 11, 2024 • edited Loading

Recap

Proposal

Footnotes

hstove commented Apr 11, 2024

djordon commented Apr 11, 2024 • edited Loading

netrome commented Apr 11, 2024

djordon commented Apr 11, 2024

netrome commented Apr 12, 2024

djordon commented Apr 12, 2024

hstove commented Apr 12, 2024

netrome commented Apr 12, 2024

netrome commented Apr 12, 2024

8marz8 commented Apr 5, 2024 •

edited by AshtonStephens

Loading

hstove commented Apr 8, 2024 •

edited

Loading

AshtonStephens commented Apr 11, 2024 •

edited

Loading

netrome commented Apr 11, 2024 •

edited

Loading

djordon commented Apr 11, 2024 •

edited

Loading