Fast state machine by pdyraga · Pull Request #3362 · threshold-network/keep-core

pdyraga · 2022-10-17T11:22:09Z

Package faststate contains a generic state machine implementation that is meant to be used with interactive protocols which do not require a strict synchronization mechanism between protocol members. The synchronization is based on an signal from each participant that they are ready to proceed to the next step when, for example, they received all the necessary information from all the other participants.

This approach allows for faster execution of protocols but has strict requirements regarding the implementation of states.

Requirement 1: Context lifetime and retransmissions

The context passed to faststate.NewMachine must be active as long as the result is not published to the chain or until a fixed time for the protocol execution has not passed.

The context is used for the retransmission of messages and all protocol participants must have a chance to receive messages from other participants. Consider the following example: there are two participants of the protocol: A and B, and they are executing the final state of the protocol. If the context was canceled right after the completion of work by the participant, without a confirmation that the result was published on-chain, we could run into a situation when A received a message from B and exited immediately, without giving B a chance to receive a message:

A: |-S------R-|
B:        |-S----------(...)

|- denotes when the last state starts and -| denotes when it ends. S denotes when the given participant sent its message and R denotes when the given participant received the message from the other. Since B initiated the last state later than A (it could execute some time-consuming computations in the last state), A already sent its message and now B needs to wait for the retransmissions. B sends its message immediately upon the initiation and this message is received by A. Since A exits and cancels the context immediately after receiving a message from B, it is no longer retransmitting its message and B hangs forever.

There are two solutions. One is that both A and B observe the chain and they keep the context active until the result is published. Another is that each member waits for some time before completing the protocol and canceling the context for retransmissions or that the entire protocol has a fixed maximum execution time until the timeout.

Requirement 2: Store all received messages

Since the state machine does not require strict synchronization between participants, it is not guaranteed at which moment of the execution the rest of the group is. If the current member is at the first state of the execution, and all other members advanced to further states, the current member should accept and store messages from further states "for the future" instead of rejecting them. This can be achieved using faststate.BaseState structure.

Despite of havinge the Type() available in every net.TaggedMarshaller message sent, we were hardcoding it to "local". This would not play well with tests that send multiple types of messages via the local broadcast channel. One such test will be added in one of the next commits.

Explained how the synchronization works and why do we need DelayBlocks().

Package faststate contains a generic state machine implementation that is meant to be used with interactive protocols which do not require a strict synchronization mechanism between protocol members. The synchronization is based on an signal from each participant that they are ready to proceed to the next step when, for example, they received all the necessary information from all the other participants. This approach allows for faster execution of protocols but has strict requirements regarding the implementation of states. Requirement 1: Context lifetime and retransmissions The context passed to `faststate.NewMachine` must be active as long as the result is not published to the chain or until a fixed time for the protocol execution has not passed. The context is used for the retransmission of messages and all protocol participants must have a chance to receive messages from other participants. Consider the following example: there are two participants of the protocol: A and B, and they are executing the final state of the protocol. If the context was canceled right after the completion of work by the participant, without a confirmation that the result was published on-chain, we could run into a situation when A received a message from B and exited immediately, without giving B a chance to receive a message: A: |-S------R-| B: |-S----------(...) |- denotes when the last state starts and -| denotes when it ends. S denotes when the given participant sent its message and R denotes when the given participant received the message from the other. Since B initiated the last state later than A (it could execute some time-consuming computations in the last state), A already sent its message and now B needs to wait for the retransmissions. B sends its message immediately upon the initiation and this message is received by A. Since A exits and cancels the context immediately after receiving a message from B, it is no longer retransmitting its message and B hangs forever. There are two solutions. One is that both A and B observe the chain and they keep the context active until the result is published. Another is that each member waits for some time before completing the protocol and canceling the context for retransmissions or that the entire protocol has a fixed maximum execution time until the timeout. Requirement 2: Store all received messages Since the state machine does not require strict synchronization between participants, it is not guaranteed at which moment of the execution the rest of the group is. If the current member is at the first state of the execution, and all other members advanced to further states, the current member should accept and store messages from further states "for the future" instead of rejecting them. This can be achieved using `faststate.BaseState` structure.

lukasz-zimnoch · 2022-10-18T11:24:54Z

@@ -0,0 +1,129 @@
+// Package faststate contains a generic state machine implementation that is


I'm wondering about alternative ways of structuring the code of both state machines. Packages state and faststate can be really mysterious for someone not familiar with this code and the differences are not obvious at the first sight.

What comes into my mind is the fact that we can call the state.Machine a "synchronous" machine as states are synchronized using the block counter. On the other hand, the new faststate.Machine is "asynchronous" as states decide about the transition using their own internal predicates. That said we can refactor the current structure twofold:

Option 1: We can rename the package state to syncstate and faststate to asyncstate. This way we will end up with the following types:

syncstate.State

syncstate.Machine

asyncstate.State

asyncstate.Machine

Option 2: We can move everything into a single package state and rename specific types. This way we would have:

state.SyncState

state.SyncMachine

state.AsyncState

state.AsyncMachine

Although O1 can be tempting as we would have two separate and small packages with well-defined types, I think we should lean towards O2. The main reasons are:

All this code is about state and both machines (and associated types) have a lot of common things so it makes sense to keep them in a single state package. We can probably extract some common parts to reduce duplication (both State interfaces share a lot of methods).

This option is closer to how we typically group things into packages. We deal with some different flavors of the same component. For example, we have the pkg/tecdsa/retry package that exposes two retry algorithms, for two separate use cases.

The size of the resulting package will still be on a sane level

I also lean towards Option 2 but I would do it in a separate PR. I am afraid the diff will be significant and we can get into some conflicts with other work.

Sure, let's do it separately!

Done in #3365.

Simplified `receive` function to just use append since it works with a nil slice. Also, improved GetAllReceivedMessages documentation to make it clear it is possible nil slice may be returned.

The continue was the last instruction in the case block so it was redundant.

The cancelCtx better explains what is canceled.

This function needs to be exported so that it can be used outside of the faststate package. At the same time, we do not want to name it Receive to force the state implementations to implement it and wrap ReceiveToHistory with a set of validations for net.Message.

Refs #3366 See this discussion: #3362 (comment) The goal of the changes is to merge together `faststate` and `state` packages. Instead of two packages, we are going to have one `state` package with two implementations of the state machine: asynchronous and synchronous. `state.SyncMachine` is meant to be used with interactive protocols when participants are expected to synchronize based on the number of blocks being mined at the time the protocol is executing. Even if the given participant received all the necessary information to continue the protocol, the state machine waits with proceeding to the next step for the fixed duration of blocks. This approach is the most optimal when the protocol may finish successfully even if some members expected to participate in the execution are inactive. `state.AsyncMachine` is meant to be used with interactive protocols when participants are expected to synchronize based on the messages being sent and some participants may be slower than others. Each protocol participant, to finish the execution, must wait for the slowest participant. This approach is the most optimal when the protocol may finish successfully only if all members expected to participate in the execution are actively participating.

Refs: #3366 Here we integrate the tECDSA signing protocol with the asynchronous state machine introduced by #3362. Changes presented in this pull request involve changes in the `pkg/tecdsa/signing` states implementations in order to conform to the asynchronous state interface. As result, the async state machine becomes the default runner used within `signing.Execute` function. Thanks to the changes presented in this PR, the time of a single signing attempt can be greatly reduced as the protocol is no longer fixed and can adjust to the slowest participant.

Here we integrate the tECDSA DKG protocol with the async state machine introduced by #3362. The presented changes are analogous to the recent changes introduced to the tECDSA signing protocol by #3379. Long story short, we no longer use a fixed block duration for protocol's states but move to a global timeout for the entire protocol attempt and switch states as soon as possible. As result, the time of a single DKG attempt can be reduced as the protocol is no longer fixed and can adjust to the slowest participant. The aforementioned change in the DKG protocol also involves a similar change in the DKG result publication step. The publication step is now based on the asynchronous state machine as well. This has two major consequences: - Firstly, so far, the participants were waiting for at least a group quorum of DKG result signatures for a fixed block duration and then moved to the result submission state. Now, all members wait for the actual group size of DKG result signatures and move to the result submission. This way, we are introducing an additional health check of group participants at the result publication step thus we maximize the final number of operating group members. This is based on the assumption that if DKG went to the publication step, all participants had to complete the computation step successfully, so they should be still operating for publication. - Secondly, the result submission queue is now time-based. Since the publication step relies on the async state machine, we can no longer use the block counter to set the submission eligibility queue. Each member preserves a time delay, according to its member index, before attempting to submit the DKG result Last but not least, we were forced to improve the message retransmission mechanism. As the DKG's TSS round two is computationally expensive, it takes ~100 blocks to complete their computations. The current retransmission strategy retransmits every message on every new block. On TSS round two, each member already retransmits messages from the ephemeral key generation state and from TSS round one. If the TSS round two state takes ~100 blocks, the amount of messages flying around the network is really huge and often caused network congestion. Also, the TSS round two state always takes almost all available CPU so, the situation became even worse. This resulted in a poor DKG success rate. To improve the situation and lower the number of messages passed through the network, we introduced a retransmission backoff strategy (#3396). All messages are now retransmitted at a decreasing rate so older messages are retransmitted less and less frequently. This allowed us to achieve a good success rate of the DKG protocol. Changes presented in this PR were tested on a local environment with three clients. Performed five DKGs and all were completed with success after ~18 minutes. That means we achieved an improvement as the DKG based on the sync machine always took ~26 minutes (fixed duration of 125 computation blocks and 5 publication blocks with an average block time of 12 seconds)

pdyraga added 3 commits October 17, 2022 12:22

Improve synchronization documentation of state.Machine

6199298

Explained how the synchronization works and why do we need DelayBlocks().

pdyraga requested a review from a team October 17, 2022 11:22

pdyraga added this to the v2.0.0-m3 milestone Oct 17, 2022

pdyraga self-assigned this Oct 17, 2022

pdyraga added the 📟 client label Oct 17, 2022

lukasz-zimnoch reviewed Oct 18, 2022

View reviewed changes

lukasz-zimnoch reviewed Oct 19, 2022

View reviewed changes

Comment thread pkg/protocol/faststate/faststate.go Outdated

pdyraga added 5 commits October 19, 2022 11:50

Small code improvements to faststate.BaseState

9d61b83

Simplified `receive` function to just use append since it works with a nil slice. Also, improved GetAllReceivedMessages documentation to make it clear it is possible nil slice may be returned.

Removed redundant continue in state machine for loop

d4fde9c

The continue was the last instruction in the case block so it was redundant.

s/cancelFn/cancelCtx

14777d2

The cancelCtx better explains what is canceled.

Adjusted package alias name to the convention

d695c91

lukasz-zimnoch approved these changes Oct 19, 2022

View reviewed changes

lukasz-zimnoch merged commit 400168c into main Oct 19, 2022

lukasz-zimnoch deleted the faststate branch October 19, 2022 14:25

pdyraga modified the milestones: v2.0.0-m3, v2.0.0-m2 Oct 24, 2022

pdyraga mentioned this pull request Oct 24, 2022

Merge together state and faststate packages #3365

Merged

lukasz-zimnoch mentioned this pull request Oct 25, 2022

tECDSA signing: Integration with the asynchronous state machine #3379

Merged

lukasz-zimnoch mentioned this pull request Nov 1, 2022

tECDSA DKG: Integration with the asynchronous state machine #3392

Merged

pdyraga removed their assignment Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast state machine#3362

Fast state machine#3362
lukasz-zimnoch merged 8 commits intomainfrom
faststate

pdyraga commented Oct 17, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukasz-zimnoch Oct 18, 2022

Uh oh!

pdyraga Oct 19, 2022

Uh oh!

lukasz-zimnoch Oct 19, 2022

Uh oh!

pdyraga Oct 24, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,129 @@
		// Package faststate contains a generic state machine implementation that is

Conversation

pdyraga commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirement 1: Context lifetime and retransmissions

Requirement 2: Store all received messages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukasz-zimnoch Oct 18, 2022

Choose a reason for hiding this comment

Uh oh!

pdyraga Oct 19, 2022

Choose a reason for hiding this comment

Uh oh!

lukasz-zimnoch Oct 19, 2022

Choose a reason for hiding this comment

Uh oh!

pdyraga Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pdyraga commented Oct 17, 2022 •

edited

Loading