Skip to content

Fast state machine#3362

Merged
lukasz-zimnoch merged 8 commits intomainfrom
faststate
Oct 19, 2022
Merged

Fast state machine#3362
lukasz-zimnoch merged 8 commits intomainfrom
faststate

Conversation

@pdyraga
Copy link
Copy Markdown
Member

@pdyraga pdyraga commented Oct 17, 2022

Refs #3366

Package faststate contains a generic state machine implementation that is meant to be used with interactive protocols which do not require a strict synchronization mechanism between protocol members. The synchronization is based on an signal from each participant that they are ready to proceed to the next step when, for example, they received all the necessary information from all the other participants.

This approach allows for faster execution of protocols but has strict requirements regarding the implementation of states.

Requirement 1: Context lifetime and retransmissions

The context passed to faststate.NewMachine must be active as long as the result is not published to the chain or until a fixed time for the protocol execution has not passed.

The context is used for the retransmission of messages and all protocol participants must have a chance to receive messages from other participants. Consider the following example: there are two participants of the protocol: A and B, and they are executing the final state of the protocol. If the context was canceled right after the completion of work by the participant, without a confirmation that the result was published on-chain, we could run into a situation when A received a message from B and exited immediately, without giving B a chance to receive a message:

A: |-S------R-|
B:        |-S----------(...)

|- denotes when the last state starts and -| denotes when it ends. S denotes when the given participant sent its message and R denotes when the given participant received the message from the other. Since B initiated the last state later than A (it could execute some time-consuming computations in the last state), A already sent its message and now B needs to wait for the retransmissions. B sends its message immediately upon the initiation and this message is received by A. Since A exits and cancels the context immediately after receiving a message from B, it is no longer retransmitting its message and B hangs forever.

There are two solutions. One is that both A and B observe the chain and they keep the context active until the result is published. Another is that each member waits for some time before completing the protocol and canceling the context for retransmissions or that the entire protocol has a fixed maximum execution time until the timeout.

Requirement 2: Store all received messages

Since the state machine does not require strict synchronization between participants, it is not guaranteed at which moment of the execution the rest of the group is. If the current member is at the first state of the execution, and all other members advanced to further states, the current member should accept and store messages from further states "for the future" instead of rejecting them. This can be achieved using faststate.BaseState structure.

Despite of havinge the Type() available in every net.TaggedMarshaller
message sent, we were hardcoding it to "local". This would not play well
with tests that send multiple types of messages via the local broadcast
channel. One such test will be added in one of the next commits.
Explained how the synchronization works and why do we need
DelayBlocks().
Package faststate contains a generic state machine implementation that is
meant to be used with interactive protocols which do not require a strict
synchronization mechanism between protocol members. The synchronization is
based on an signal from each participant that they are ready to proceed to
the next step when, for example, they received all the necessary information
from all the other participants.

This approach allows for faster execution of protocols but has strict
requirements regarding the implementation of states.

Requirement 1: Context lifetime and retransmissions

The context passed to `faststate.NewMachine` must be active as long as the
result is not published to the chain or until a fixed time for the protocol
execution has not passed.

The context is used for the retransmission of messages and all protocol
participants must have a chance to receive messages from other participants.
Consider the following example: there are two participants of the protocol:
A and B, and they are executing the final state of the protocol. If the
context was canceled right after the completion of work by the participant,
without a confirmation that the result was published on-chain, we could run
into a situation when A received a message from B and exited immediately,
without giving B a chance to receive a message:

A: |-S------R-|
B:        |-S----------(...)

|- denotes when the last state starts and -| denotes when it ends. S denotes
when the given participant sent its message and R denotes when the given
participant received the message from the other. Since B initiated the last
state later than A (it could execute some time-consuming computations in the
last state), A already sent its message and now B needs to wait for the
retransmissions. B sends its message immediately upon the initiation and this
message is received by A. Since A exits and cancels the context immediately
after receiving a message from B, it is no longer retransmitting its message
and B hangs forever.

There are two solutions. One is that both A and B observe the chain and they
keep the context active until the result is published. Another is that each
member waits for some time before completing the protocol and canceling the
context for retransmissions or that the entire protocol has a fixed maximum
execution time until the timeout.

Requirement 2: Store all received messages

Since the state machine does not require strict synchronization between
participants, it is not guaranteed at which moment of the execution the rest
of the group is. If the current member is at the first state of the
execution, and all other members advanced to further states, the current
member should accept and store messages from further states "for the future"
instead of rejecting them. This can be achieved using `faststate.BaseState`
structure.
@pdyraga pdyraga requested a review from a team October 17, 2022 11:22
@pdyraga pdyraga added this to the v2.0.0-m3 milestone Oct 17, 2022
@pdyraga pdyraga self-assigned this Oct 17, 2022
Comment thread pkg/protocol/faststate/faststate.go Outdated
Comment thread pkg/protocol/faststate/faststate.go Outdated
Comment thread pkg/protocol/faststate/machine.go Outdated
Comment thread pkg/protocol/faststate/machine_test.go Outdated
Comment thread pkg/protocol/faststate/machine_test.go Outdated
@@ -0,0 +1,129 @@
// Package faststate contains a generic state machine implementation that is
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about alternative ways of structuring the code of both state machines. Packages state and faststate can be really mysterious for someone not familiar with this code and the differences are not obvious at the first sight.

What comes into my mind is the fact that we can call the state.Machine a "synchronous" machine as states are synchronized using the block counter. On the other hand, the new faststate.Machine is "asynchronous" as states decide about the transition using their own internal predicates. That said we can refactor the current structure twofold:

Option 1: We can rename the package state to syncstate and faststate to asyncstate. This way we will end up with the following types:

  • syncstate.State
  • syncstate.Machine
  • asyncstate.State
  • asyncstate.Machine

Option 2: We can move everything into a single package state and rename specific types. This way we would have:

  • state.SyncState
  • state.SyncMachine
  • state.AsyncState
  • state.AsyncMachine

Although O1 can be tempting as we would have two separate and small packages with well-defined types, I think we should lean towards O2. The main reasons are:

  • All this code is about state and both machines (and associated types) have a lot of common things so it makes sense to keep them in a single state package. We can probably extract some common parts to reduce duplication (both State interfaces share a lot of methods).
  • This option is closer to how we typically group things into packages. We deal with some different flavors of the same component. For example, we have the pkg/tecdsa/retry package that exposes two retry algorithms, for two separate use cases.
  • The size of the resulting package will still be on a sane level

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also lean towards Option 2 but I would do it in a separate PR. I am afraid the diff will be significant and we can get into some conflicts with other work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's do it separately!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #3365.

Comment thread pkg/protocol/faststate/faststate.go Outdated
Simplified `receive` function to just use append since it works with a
nil slice. Also, improved GetAllReceivedMessages documentation to make
it clear it is possible nil slice may be returned.
The continue was the last instruction in the case block so it was
redundant.
The cancelCtx better explains what is canceled.
This function needs to be exported so that it can be used outside of the
faststate package. At the same time, we do not want to name it Receive
to force the state implementations to implement it and wrap
ReceiveToHistory with a set of validations for net.Message.
@lukasz-zimnoch lukasz-zimnoch merged commit 400168c into main Oct 19, 2022
@lukasz-zimnoch lukasz-zimnoch deleted the faststate branch October 19, 2022 14:25
@pdyraga pdyraga modified the milestones: v2.0.0-m3, v2.0.0-m2 Oct 24, 2022
lukasz-zimnoch added a commit that referenced this pull request Oct 25, 2022
Refs #3366
See this discussion:
#3362 (comment)

The goal of the changes is to merge together `faststate` and `state`
packages. Instead of two packages, we are going to have one `state`
package with two implementations of the state machine: asynchronous and
synchronous.
    
`state.SyncMachine` is meant to be used with interactive protocols when
participants are expected to synchronize based on the number of blocks
being mined at the time the protocol is executing. Even if the given
participant received all the necessary information to continue the
protocol, the state machine waits with proceeding to the next step for
the fixed duration of blocks. This approach is the most optimal when the
protocol may finish successfully even if some members expected to
participate in the execution are inactive.
    
`state.AsyncMachine` is meant to be used with interactive protocols when
participants are expected to synchronize based on the messages being
sent and some participants may be slower than others. Each protocol
participant, to finish the execution, must wait for the slowest
participant. This approach is the most optimal when the protocol may
finish successfully only if all members expected to participate in the
execution are actively participating.
pdyraga added a commit that referenced this pull request Oct 26, 2022
Refs: #3366

Here we integrate the tECDSA signing protocol with the asynchronous
state machine introduced by
#3362. Changes presented
in this pull request involve changes in the `pkg/tecdsa/signing` states
implementations in order to conform to the asynchronous state interface.
As result, the async state machine becomes the default runner used
within `signing.Execute` function. Thanks to the changes presented in
this PR, the time of a single signing attempt can be greatly reduced as
the protocol is no longer fixed and can adjust to the slowest
participant.
pdyraga added a commit that referenced this pull request Nov 10, 2022
Here we integrate the tECDSA DKG protocol with the async state machine
introduced by #3362. The presented
changes are analogous to the recent changes introduced to the tECDSA signing
protocol by #3379. Long story short, we no longer use a fixed block duration for
protocol's states but move to a global timeout for the entire protocol attempt
and switch states as soon as possible. As result, the time of a single DKG
attempt can be reduced as the protocol is no longer fixed and can adjust to the
slowest participant.

The aforementioned change in the DKG protocol also involves a similar change in
the DKG result publication step. The publication step is now based on the
asynchronous state machine as well. This has two major consequences:
- Firstly, so far, the participants were waiting for at least a group quorum of
DKG result signatures for a fixed block duration and then moved to the result
submission state. Now, all members wait for the actual group size of DKG result
signatures and move to the result submission. This way, we are introducing an
additional health check of group participants at the result publication step
thus we maximize the final number of operating group members. This is based on
the assumption that if DKG went to the publication step, all participants had to
complete the computation step successfully, so they should be still operating
for publication. 
- Secondly, the result submission queue is now time-based. Since the publication
step relies on the async state machine, we can no longer use the block counter
to set the submission eligibility queue. Each member preserves a time delay,
according to its member index, before attempting to submit the DKG result

Last but not least, we were forced to improve the message retransmission
mechanism. As the DKG's TSS round two is computationally expensive, it takes
~100 blocks to complete their computations. The current retransmission strategy
retransmits every message on every new block. On TSS round two, each member
already retransmits messages from the ephemeral key generation state and from
TSS round one. If the TSS round two state takes ~100 blocks, the amount of
messages flying around the network is really huge and often caused network
congestion. Also, the TSS round two state always takes almost all available CPU
so, the situation became even worse. This resulted in a poor DKG success rate.
To improve the situation and lower the number of messages passed through the
network, we introduced a retransmission backoff strategy (#3396). All messages
are now retransmitted at a decreasing rate so older messages are retransmitted
less and less frequently. This allowed us to achieve a good success rate of the
DKG protocol.

Changes presented in this PR were tested on a local environment with three
clients. Performed five DKGs and all were completed with success after
~18 minutes. That means we achieved an improvement as the DKG based on the sync
machine always took ~26 minutes (fixed duration of 125 computation blocks and 5
publication blocks with an average block time of 12 seconds)
@pdyraga pdyraga removed their assignment Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants