Fast state machine#3362
Conversation
Despite of havinge the Type() available in every net.TaggedMarshaller message sent, we were hardcoding it to "local". This would not play well with tests that send multiple types of messages via the local broadcast channel. One such test will be added in one of the next commits.
Explained how the synchronization works and why do we need DelayBlocks().
Package faststate contains a generic state machine implementation that is meant to be used with interactive protocols which do not require a strict synchronization mechanism between protocol members. The synchronization is based on an signal from each participant that they are ready to proceed to the next step when, for example, they received all the necessary information from all the other participants. This approach allows for faster execution of protocols but has strict requirements regarding the implementation of states. Requirement 1: Context lifetime and retransmissions The context passed to `faststate.NewMachine` must be active as long as the result is not published to the chain or until a fixed time for the protocol execution has not passed. The context is used for the retransmission of messages and all protocol participants must have a chance to receive messages from other participants. Consider the following example: there are two participants of the protocol: A and B, and they are executing the final state of the protocol. If the context was canceled right after the completion of work by the participant, without a confirmation that the result was published on-chain, we could run into a situation when A received a message from B and exited immediately, without giving B a chance to receive a message: A: |-S------R-| B: |-S----------(...) |- denotes when the last state starts and -| denotes when it ends. S denotes when the given participant sent its message and R denotes when the given participant received the message from the other. Since B initiated the last state later than A (it could execute some time-consuming computations in the last state), A already sent its message and now B needs to wait for the retransmissions. B sends its message immediately upon the initiation and this message is received by A. Since A exits and cancels the context immediately after receiving a message from B, it is no longer retransmitting its message and B hangs forever. There are two solutions. One is that both A and B observe the chain and they keep the context active until the result is published. Another is that each member waits for some time before completing the protocol and canceling the context for retransmissions or that the entire protocol has a fixed maximum execution time until the timeout. Requirement 2: Store all received messages Since the state machine does not require strict synchronization between participants, it is not guaranteed at which moment of the execution the rest of the group is. If the current member is at the first state of the execution, and all other members advanced to further states, the current member should accept and store messages from further states "for the future" instead of rejecting them. This can be achieved using `faststate.BaseState` structure.
| @@ -0,0 +1,129 @@ | |||
| // Package faststate contains a generic state machine implementation that is | |||
There was a problem hiding this comment.
I'm wondering about alternative ways of structuring the code of both state machines. Packages state and faststate can be really mysterious for someone not familiar with this code and the differences are not obvious at the first sight.
What comes into my mind is the fact that we can call the state.Machine a "synchronous" machine as states are synchronized using the block counter. On the other hand, the new faststate.Machine is "asynchronous" as states decide about the transition using their own internal predicates. That said we can refactor the current structure twofold:
Option 1: We can rename the package state to syncstate and faststate to asyncstate. This way we will end up with the following types:
syncstate.Statesyncstate.Machineasyncstate.Stateasyncstate.Machine
Option 2: We can move everything into a single package state and rename specific types. This way we would have:
state.SyncStatestate.SyncMachinestate.AsyncStatestate.AsyncMachine
Although O1 can be tempting as we would have two separate and small packages with well-defined types, I think we should lean towards O2. The main reasons are:
- All this code is about state and both machines (and associated types) have a lot of common things so it makes sense to keep them in a single
statepackage. We can probably extract some common parts to reduce duplication (bothStateinterfaces share a lot of methods). - This option is closer to how we typically group things into packages. We deal with some different flavors of the same component. For example, we have the
pkg/tecdsa/retrypackage that exposes two retry algorithms, for two separate use cases. - The size of the resulting package will still be on a sane level
There was a problem hiding this comment.
I also lean towards Option 2 but I would do it in a separate PR. I am afraid the diff will be significant and we can get into some conflicts with other work.
There was a problem hiding this comment.
Sure, let's do it separately!
Simplified `receive` function to just use append since it works with a nil slice. Also, improved GetAllReceivedMessages documentation to make it clear it is possible nil slice may be returned.
The continue was the last instruction in the case block so it was redundant.
The cancelCtx better explains what is canceled.
This function needs to be exported so that it can be used outside of the faststate package. At the same time, we do not want to name it Receive to force the state implementations to implement it and wrap ReceiveToHistory with a set of validations for net.Message.
Refs #3366 See this discussion: #3362 (comment) The goal of the changes is to merge together `faststate` and `state` packages. Instead of two packages, we are going to have one `state` package with two implementations of the state machine: asynchronous and synchronous. `state.SyncMachine` is meant to be used with interactive protocols when participants are expected to synchronize based on the number of blocks being mined at the time the protocol is executing. Even if the given participant received all the necessary information to continue the protocol, the state machine waits with proceeding to the next step for the fixed duration of blocks. This approach is the most optimal when the protocol may finish successfully even if some members expected to participate in the execution are inactive. `state.AsyncMachine` is meant to be used with interactive protocols when participants are expected to synchronize based on the messages being sent and some participants may be slower than others. Each protocol participant, to finish the execution, must wait for the slowest participant. This approach is the most optimal when the protocol may finish successfully only if all members expected to participate in the execution are actively participating.
Refs: #3366 Here we integrate the tECDSA signing protocol with the asynchronous state machine introduced by #3362. Changes presented in this pull request involve changes in the `pkg/tecdsa/signing` states implementations in order to conform to the asynchronous state interface. As result, the async state machine becomes the default runner used within `signing.Execute` function. Thanks to the changes presented in this PR, the time of a single signing attempt can be greatly reduced as the protocol is no longer fixed and can adjust to the slowest participant.
Here we integrate the tECDSA DKG protocol with the async state machine introduced by #3362. The presented changes are analogous to the recent changes introduced to the tECDSA signing protocol by #3379. Long story short, we no longer use a fixed block duration for protocol's states but move to a global timeout for the entire protocol attempt and switch states as soon as possible. As result, the time of a single DKG attempt can be reduced as the protocol is no longer fixed and can adjust to the slowest participant. The aforementioned change in the DKG protocol also involves a similar change in the DKG result publication step. The publication step is now based on the asynchronous state machine as well. This has two major consequences: - Firstly, so far, the participants were waiting for at least a group quorum of DKG result signatures for a fixed block duration and then moved to the result submission state. Now, all members wait for the actual group size of DKG result signatures and move to the result submission. This way, we are introducing an additional health check of group participants at the result publication step thus we maximize the final number of operating group members. This is based on the assumption that if DKG went to the publication step, all participants had to complete the computation step successfully, so they should be still operating for publication. - Secondly, the result submission queue is now time-based. Since the publication step relies on the async state machine, we can no longer use the block counter to set the submission eligibility queue. Each member preserves a time delay, according to its member index, before attempting to submit the DKG result Last but not least, we were forced to improve the message retransmission mechanism. As the DKG's TSS round two is computationally expensive, it takes ~100 blocks to complete their computations. The current retransmission strategy retransmits every message on every new block. On TSS round two, each member already retransmits messages from the ephemeral key generation state and from TSS round one. If the TSS round two state takes ~100 blocks, the amount of messages flying around the network is really huge and often caused network congestion. Also, the TSS round two state always takes almost all available CPU so, the situation became even worse. This resulted in a poor DKG success rate. To improve the situation and lower the number of messages passed through the network, we introduced a retransmission backoff strategy (#3396). All messages are now retransmitted at a decreasing rate so older messages are retransmitted less and less frequently. This allowed us to achieve a good success rate of the DKG protocol. Changes presented in this PR were tested on a local environment with three clients. Performed five DKGs and all were completed with success after ~18 minutes. That means we achieved an improvement as the DKG based on the sync machine always took ~26 minutes (fixed duration of 125 computation blocks and 5 publication blocks with an average block time of 12 seconds)
Refs #3366
Package
faststatecontains a generic state machine implementation that is meant to be used with interactive protocols which do not require a strict synchronization mechanism between protocol members. The synchronization is based on an signal from each participant that they are ready to proceed to the next step when, for example, they received all the necessary information from all the other participants.This approach allows for faster execution of protocols but has strict requirements regarding the implementation of states.
Requirement 1: Context lifetime and retransmissions
The context passed to
faststate.NewMachinemust be active as long as the result is not published to the chain or until a fixed time for the protocol execution has not passed.The context is used for the retransmission of messages and all protocol participants must have a chance to receive messages from other participants. Consider the following example: there are two participants of the protocol: A and B, and they are executing the final state of the protocol. If the context was canceled right after the completion of work by the participant, without a confirmation that the result was published on-chain, we could run into a situation when A received a message from B and exited immediately, without giving B a chance to receive a message:
|-denotes when the last state starts and-|denotes when it ends.Sdenotes when the given participant sent its message andRdenotes when the given participant received the message from the other. Since B initiated the last state later than A (it could execute some time-consuming computations in the last state), A already sent its message and now B needs to wait for the retransmissions. B sends its message immediately upon the initiation and this message is received by A. Since A exits and cancels the context immediately after receiving a message from B, it is no longer retransmitting its message and B hangs forever.There are two solutions. One is that both A and B observe the chain and they keep the context active until the result is published. Another is that each member waits for some time before completing the protocol and canceling the context for retransmissions or that the entire protocol has a fixed maximum execution time until the timeout.
Requirement 2: Store all received messages
Since the state machine does not require strict synchronization between participants, it is not guaranteed at which moment of the execution the rest of the group is. If the current member is at the first state of the execution, and all other members advanced to further states, the current member should accept and store messages from further states "for the future" instead of rejecting them. This can be achieved using
faststate.BaseStatestructure.