tECDSA signing: Integration with the asynchronous state machine#3379
tECDSA signing: Integration with the asynchronous state machine#3379
Conversation
Here we integrate the fast state machine with the tECDSA signing protocol. All states defined in the `pkg/tecdsa/signing` now conform the fast state interface. Also, the fast state machine is now used as state executor.
…ssions We now use a global context in the state machine which is active until the timeout block. This context is used in all `channel.Send` calls that manage retransmissions under the hood. The context is used as key in the map that holds the functions responsible for specific retransmissions. If we call `channel .Send` with the same context for different messages, the latest call will overwrite previous retransmissions with the current one. To avoid that, we use a separate child context for each message.
| return ephemeralKeyPairStateDelayBlocks | ||
| } | ||
| func (ekpgs *ephemeralKeyPairGenerationState) Initiate(ctx context.Context) error { | ||
| ekpgs.action.run(func() error { |
There was a problem hiding this comment.
Why do we need to wrap the call to generateEphemeralKeyPair() with ekpgs.action.run? The state machine should guarantee Initiate is called only one time.
There was a problem hiding this comment.
This is not about the number of calls to Initiate. The reasoning is as follows:
- we need to wait until all ephemeral keys are generated and the state message is sent before we proceed to the next state
- currently, the state machine implementation guarantees that
Initiateis called beforeCanTransitionbut this is an implementation detail that can change in the future so we cannot consider it as a strong guarantee. Because of that, we should check whether the state completed its computations and sent its message before proceeding in the state implementation as well. We don't want to check state internals (i.e. the number of generated keys) as this is not the right abstraction level for that. Additionally, that check does not guarantee thatchannel.Sendwas called Initiateshould return as fast as possible as the message buffer can become filled meanwhile. Running an async action matches that requirement- Almost all states of signing perform async computations so using this pattern for
ephemeralKeyPairGenerationStateis justified as we have consistency across all states
There was a problem hiding this comment.
It seems to me that this is a problem with a state machine and not something the state implementation should worry about.
What do you think about:
- Making
AsyncState.Initiateasynchronous in the state machine. - Making
AsyncState.Initiatereturn a channel when the computations are done. - Probably getting rid of
stateActionstructure.
I can try experimenting with this approach, if it works fine, we can open another PR that will be merged immediately after this one.
There was a problem hiding this comment.
I think that will make the state machine much complicated as we need to examine the state of computations triggered by Initiate in CanTransition. There is also a risk that we'll specialize the machine too much and it will not fit other possible use cases well enough. Of course, we can always drive some experiments and see.
There was a problem hiding this comment.
Agreed, this will move more logic to the state machine but right now this logic is outside of the state machine and complicates state implementation. We do not necessarily need to modify the machine, we could also move some of this logic to the base async state. Anyway, something to experiment with.
Some loose thoughts here:
-
stateActionensures the given action can be triggered only one time. This is something the Initiate interface should be clear about and ensure. The state implementation should just trust this assumption. -
In the async state machine we gather all messages received, also those from the future. Blocking
Initiatemakes sense for the sync state machine, where we may want to initiate a state before we are ready to accept messages. For the async state machine, we want to accept messages all the time and blocking Initiate is not really needed. -
On the abstract level, the state machine should not be asking the state if it
CanTransitionifInitiateof the state has not been completed yet, no matter ifInitiateis blocking or not. This is happening whenInitiateis blocking, but requires some more logic from the state ifInitiatelogic is async.
| // cannot be used directly because it is global for the entire | ||
| // state machine. Because of that, channel.Send must be called with | ||
| // a child context. | ||
| sendCtx := context.WithValue(ctx, "message", message.Type()) |
There was a problem hiding this comment.
Why do we use WithValue? Is it to trick the ticker.go to distinguish contexts and not treat contexts for two different states as equal?
There was a problem hiding this comment.
We don't have a simple way to create a context copy in Go apart wrapping it with context.WithValue. All other functions like WithCancel or WithDeadline produce a cancel function that must be managed somehow. We can't just ignore it as, in theory, we don't have a guarantee that the parent context is always canceled (that will cause context leaks). On the other hand, we cannot setup a proper call to the cancel function at this level. This is why the WithValue trick is a good option here, especially if we treat this solution as a temporary one until we fix the problem directly in ticker.go.
# Conflicts: # pkg/tecdsa/signing/signing.go # pkg/tecdsa/signing/states.go
|
|
||
| // signingAttemptMaxBlockDuration determines the maximum block duration of a | ||
| // single signing attempt. | ||
| const signingAttemptMaxBlockDuration = 100 |
There was a problem hiding this comment.
Just flagging that this value requires more thinking. 100 blocks assuming 12sec for a single block is 20 minutes.
With 2 malicious members in a signing group, we need 5 retries of the protocol in the worst case (P = (98 choose 51) / (100 choose 51) = 0.23757575757), which takes 1h 40min.
With 3 malicious members in a signing group, we need 10 retries of the protocol in the worst case (P = (97 choose 51) / (100 choose 51) = 0.11393939393), which takes 3h 20 min.
Say we want to sweep no more than 10 deposits at once, this will take almost 17 hours for 2 malicious members case and twice that much for 3 malicious members case. Not blocking this PR, rather something for threshold-network/tbtc-v2#414
There was a problem hiding this comment.
...plus the blocks for announcement and delay!
The state machine execution loop should exit once the machine context is done. Otherwise, the loop will work forever causing a resource leak.
|
The code here so far looks good but I wasn't testing yet. Waiting for two more changes planned by @lukasz-zimnoch:
The comments left are for future PRs, I would like to do some experimentation with the state machine based on the comments. |
The stop pill should be broadcasted after a delay to give other participants some time to complete the result computation.
|
| } | ||
|
|
||
| return result | ||
| } |
There was a problem hiding this comment.
Both receivedMessages and receivedMessages could be abstracted somewhere. I am not sure if the state package is the best place for it but we will have to repeat the same code over and over again. Not blocking this PR though, just flagging.
There was a problem hiding this comment.
Yeah, I was thinking about it as well and had the same feelings.
| func (ekpgs *ephemeralKeyPairGenerationState) Next() (state.SyncState, error) { | ||
| func (ekpgs *ephemeralKeyPairGenerationState) CanTransition() bool { | ||
| messagingDone := len(receivedMessages[*ephemeralPublicKeyMessage](ekpgs)) == | ||
| len(ekpgs.member.group.OperatingMemberIDs())-1 |
There was a problem hiding this comment.
Would be great to abstract it out and calculate once when instantiating the initial state. Non-blocking, something for the future.
| // be more sophisticated but since it is temporary, we | ||
| // can live with it for now. | ||
| go func() { | ||
| time.Sleep(1 * time.Minute) |
There was a problem hiding this comment.
This time has to be shorter than the delay between attempts, right?
There was a problem hiding this comment.
This is not a hard requirement. It will work even if this time is longer.
There was a problem hiding this comment.
But then operators who were excluded from the current attempt may kick off the execution of the next attempt, correct?
There was a problem hiding this comment.
Yeah, but once the stop pill arrives they will break. Even if the pill will not arrive they will just fail eventually. Anyway, the stop pill mechanism will need to be reworked or replaced eventually once we have integration with the BTC network.
|
Tested the following scenarios using a local environment with 3 clients. Scenario 1: All clients running, signature requested, and produced without errors after 12 blocks. Scenario 2: All clients running, signature requested but one client goes down during execution. Other clients wait until the 100 blocks timeout gets hit and start the second attempt which completes with success after 15 blocks. Scenario 3: All clients running, signature requested but resource (CPU) shortage occurs during the processing (DKG round two happens at the same time). Execution slows down but finally, the signature is produced after 20 blocks. Scenario 4: All clients running, signature requested but some members start with a delay (temporary hack in the code). Signature produced after 18 blocks. |
|
Scenarios tested: Scenario 1: All clients running, signature requested, and produced successfully. Scenario 2: All clients running, signature requested but one client goes down during execution. Other clients wait until the 100 blocks timeout gets hit and start the second attempt which completes successfully. There are some suggestions left in this PR but I'll try to address them separately. This PR is a huge improvement and the suggestions require some experimentation. |
Refs #3366 See #3379 (comment) See 16a91a1 Previously we were identifying retransmission handlers by context. This did work well when each state for each member had a separate context in the synchronous state machine. It did not work well though when the same context was shared for all states in the async state machine. Given that the retransmission ticker identified retransmission functions by context, new retransmission requests were overwriting previous ones using the same context. We fixed it temporarily by creating new instances of child contexts but that was not a final solution we wanted to go with. This commit fixes the problem in a proper way (TM). Each retransmission has a unique ID and no matter if the same or separate contexts are used, registering new retransmission requests does not overwrite existing ones. Tested this PR by running a single signing that succeeded in ~3 minutes (with announcement phase included).
Refs #3366 See #3379 (comment) In the async state machine we gather all messages received, also those from the "future" states. Blocking `Initiate` makes sense for the sync state machine, where we may want to initiate a state before we are ready to accept messages. For the async state machine, we want to accept messages all the time and blocking `Initiate` is not needed. Having `Initiate` running asynchronously will allow us to `Recv` messages to states and do not overflow the buffer in the state machine in case the computations take longer than expected. It should also allow for a faster execution of states because the messages are received also at the time of Initiating the state so we can potentially transition right after the computations are done. The state machine should not be asking the state if it `CanTransition` if `Initiate` of the state has not been completed yet. This was guaranteed with a blocking `Initiate` and has been maintained for the non-blocking `Initiate`. I've made it clear in the documentation of the AsyncState. This change has been done in 428e013. The next commit, 4a8a8c5, is a proof how we could simplify state code with that change. I recommend reviewing the two commits separately.
Here we integrate the tECDSA DKG protocol with the async state machine introduced by #3362. The presented changes are analogous to the recent changes introduced to the tECDSA signing protocol by #3379. Long story short, we no longer use a fixed block duration for protocol's states but move to a global timeout for the entire protocol attempt and switch states as soon as possible. As result, the time of a single DKG attempt can be reduced as the protocol is no longer fixed and can adjust to the slowest participant. The aforementioned change in the DKG protocol also involves a similar change in the DKG result publication step. The publication step is now based on the asynchronous state machine as well. This has two major consequences: - Firstly, so far, the participants were waiting for at least a group quorum of DKG result signatures for a fixed block duration and then moved to the result submission state. Now, all members wait for the actual group size of DKG result signatures and move to the result submission. This way, we are introducing an additional health check of group participants at the result publication step thus we maximize the final number of operating group members. This is based on the assumption that if DKG went to the publication step, all participants had to complete the computation step successfully, so they should be still operating for publication. - Secondly, the result submission queue is now time-based. Since the publication step relies on the async state machine, we can no longer use the block counter to set the submission eligibility queue. Each member preserves a time delay, according to its member index, before attempting to submit the DKG result Last but not least, we were forced to improve the message retransmission mechanism. As the DKG's TSS round two is computationally expensive, it takes ~100 blocks to complete their computations. The current retransmission strategy retransmits every message on every new block. On TSS round two, each member already retransmits messages from the ephemeral key generation state and from TSS round one. If the TSS round two state takes ~100 blocks, the amount of messages flying around the network is really huge and often caused network congestion. Also, the TSS round two state always takes almost all available CPU so, the situation became even worse. This resulted in a poor DKG success rate. To improve the situation and lower the number of messages passed through the network, we introduced a retransmission backoff strategy (#3396). All messages are now retransmitted at a decreasing rate so older messages are retransmitted less and less frequently. This allowed us to achieve a good success rate of the DKG protocol. Changes presented in this PR were tested on a local environment with three clients. Performed five DKGs and all were completed with success after ~18 minutes. That means we achieved an improvement as the DKG based on the sync machine always took ~26 minutes (fixed duration of 125 computation blocks and 5 publication blocks with an average block time of 12 seconds)
Refs: #3366
Here we integrate the tECDSA signing protocol with the asynchronous state machine introduced by #3362. Changes presented in this pull request involve changes in the
pkg/tecdsa/signingstates implementations in order to conform to the asynchronous state interface. As result, the async state machine becomes the default runner used withinsigning.Executefunction. Thanks to the changes presented in this PR, the time of a single signing attempt can be greatly reduced as the protocol is no longer fixed and can adjust to the slowest participant.