tECDSA signing: Integration with the asynchronous state machine by lukasz-zimnoch · Pull Request #3379 · threshold-network/keep-core

lukasz-zimnoch · 2022-10-25T09:19:26Z

Here we integrate the tECDSA signing protocol with the asynchronous state machine introduced by #3362. Changes presented in this pull request involve changes in the pkg/tecdsa/signing states implementations in order to conform to the asynchronous state interface. As result, the async state machine becomes the default runner used within signing.Execute function. Thanks to the changes presented in this PR, the time of a single signing attempt can be greatly reduced as the protocol is no longer fixed and can adjust to the slowest participant.

Here we integrate the fast state machine with the tECDSA signing protocol. All states defined in the `pkg/tecdsa/signing` now conform the fast state interface. Also, the fast state machine is now used as state executor.

…ssions We now use a global context in the state machine which is active until the timeout block. This context is used in all `channel.Send` calls that manage retransmissions under the hood. The context is used as key in the map that holds the functions responsible for specific retransmissions. If we call `channel .Send` with the same context for different messages, the latest call will overwrite previous retransmissions with the current one. To avoid that, we use a separate child context for each message.

pdyraga · 2022-10-25T10:18:55Z

-	return ephemeralKeyPairStateDelayBlocks
-}
+func (ekpgs *ephemeralKeyPairGenerationState) Initiate(ctx context.Context) error {
+	ekpgs.action.run(func() error {


Why do we need to wrap the call to generateEphemeralKeyPair() with ekpgs.action.run? The state machine should guarantee Initiate is called only one time.

This is not about the number of calls to Initiate. The reasoning is as follows:

we need to wait until all ephemeral keys are generated and the state message is sent before we proceed to the next state

currently, the state machine implementation guarantees that Initiate is called before CanTransition but this is an implementation detail that can change in the future so we cannot consider it as a strong guarantee. Because of that, we should check whether the state completed its computations and sent its message before proceeding in the state implementation as well. We don't want to check state internals (i.e. the number of generated keys) as this is not the right abstraction level for that. Additionally, that check does not guarantee that channel.Send was called

Initiate should return as fast as possible as the message buffer can become filled meanwhile. Running an async action matches that requirement

Almost all states of signing perform async computations so using this pattern for ephemeralKeyPairGenerationState is justified as we have consistency across all states

It seems to me that this is a problem with a state machine and not something the state implementation should worry about.

What do you think about:

Making AsyncState.Initiate asynchronous in the state machine.

Making AsyncState.Initiate return a channel when the computations are done.

Probably getting rid of stateAction structure.

I can try experimenting with this approach, if it works fine, we can open another PR that will be merged immediately after this one.

I think that will make the state machine much complicated as we need to examine the state of computations triggered by Initiate in CanTransition. There is also a risk that we'll specialize the machine too much and it will not fit other possible use cases well enough. Of course, we can always drive some experiments and see.

Agreed, this will move more logic to the state machine but right now this logic is outside of the state machine and complicates state implementation. We do not necessarily need to modify the machine, we could also move some of this logic to the base async state. Anyway, something to experiment with.

Some loose thoughts here:

stateAction ensures the given action can be triggered only one time. This is something the Initiate interface should be clear about and ensure. The state implementation should just trust this assumption.

In the async state machine we gather all messages received, also those from the future. Blocking Initiate makes sense for the sync state machine, where we may want to initiate a state before we are ready to accept messages. For the async state machine, we want to accept messages all the time and blocking Initiate is not really needed.

On the abstract level, the state machine should not be asking the state if it CanTransition if Initiate of the state has not been completed yet, no matter if Initiate is blocking or not. This is happening when Initiate is blocking, but requires some more logic from the state if Initiate logic is async.

Addressed in #3385.

pdyraga · 2022-10-25T10:25:44Z

+	// cannot be used directly because it is global for the entire
+	// state machine. Because of that, channel.Send must be called with
+	// a child context.
+	sendCtx := context.WithValue(ctx, "message", message.Type())


Why do we use WithValue? Is it to trick the ticker.go to distinguish contexts and not treat contexts for two different states as equal?

We don't have a simple way to create a context copy in Go apart wrapping it with context.WithValue. All other functions like WithCancel or WithDeadline produce a cancel function that must be managed somehow. We can't just ignore it as, in theory, we don't have a guarantee that the parent context is always canceled (that will cause context leaks). On the other hand, we cannot setup a proper call to the cancel function at this level. This is why the WithValue trick is a good option here, especially if we treat this solution as a temporary one until we fix the problem directly in ticker.go.

Addressed in #3383.

# Conflicts: # pkg/tecdsa/signing/signing.go # pkg/tecdsa/signing/states.go

pdyraga · 2022-10-25T12:34:07Z


+// signingAttemptMaxBlockDuration determines the maximum block duration of a
+// single signing attempt.
+const signingAttemptMaxBlockDuration = 100


Just flagging that this value requires more thinking. 100 blocks assuming 12sec for a single block is 20 minutes.

With 2 malicious members in a signing group, we need 5 retries of the protocol in the worst case (P = (98 choose 51) / (100 choose 51) = 0.23757575757), which takes 1h 40min.

With 3 malicious members in a signing group, we need 10 retries of the protocol in the worst case (P = (97 choose 51) / (100 choose 51) = 0.11393939393), which takes 3h 20 min.

Say we want to sweep no more than 10 deposits at once, this will take almost 17 hours for 2 malicious members case and twice that much for 3 malicious members case. Not blocking this PR, rather something for threshold-network/tbtc-v2#414

...plus the blocks for announcement and delay!

The state machine execution loop should exit once the machine context is done. Otherwise, the loop will work forever causing a resource leak.

pdyraga · 2022-10-25T13:16:25Z

The code here so far looks good but I wasn't testing yet. Waiting for two more changes planned by @lukasz-zimnoch:

delay sending the stop pill,
exit the state machine's Execute when the context is done.

The comments left are for future PRs, I would like to do some experimentation with the state machine based on the comments.

The stop pill should be broadcasted after a delay to give other participants some time to complete the result computation.

lukasz-zimnoch · 2022-10-25T13:18:11Z

The code here so far looks good but I wasn't testing yet. Waiting for two more changes planned by @lukasz-zimnoch:

delay sending the stop pill,

exit the state machine's Execute when the context is done.

The comments left are for future PRs, I would like to do some experimentation with the state machine based on the comments.

Done in 16007dd and b2b033e

pdyraga · 2022-10-25T12:46:53Z

+	}
+
+	return result
+}


Both receivedMessages and receivedMessages could be abstracted somewhere. I am not sure if the state package is the best place for it but we will have to repeat the same code over and over again. Not blocking this PR though, just flagging.

Yeah, I was thinking about it as well and had the same feelings.

Addressed in #3386

pdyraga · 2022-10-25T13:05:23Z

-func (ekpgs *ephemeralKeyPairGenerationState) Next() (state.SyncState, error) {
+func (ekpgs *ephemeralKeyPairGenerationState) CanTransition() bool {
+	messagingDone := len(receivedMessages[*ephemeralPublicKeyMessage](ekpgs)) ==
+		len(ekpgs.member.group.OperatingMemberIDs())-1


Would be great to abstract it out and calculate once when instantiating the initial state. Non-blocking, something for the future.

pdyraga · 2022-10-25T13:18:14Z

+						// be more sophisticated but since it is temporary, we
+						// can live with it for now.
+						go func() {
+							time.Sleep(1 * time.Minute)


This time has to be shorter than the delay between attempts, right?

This is not a hard requirement. It will work even if this time is longer.

But then operators who were excluded from the current attempt may kick off the execution of the next attempt, correct?

Yeah, but once the stop pill arrives they will break. Even if the pill will not arrive they will just fail eventually. Anyway, the stop pill mechanism will need to be reworked or replaced eventually once we have integration with the BTC network.

lukasz-zimnoch · 2022-10-25T15:22:45Z

Tested the following scenarios using a local environment with 3 clients.

Scenario 1: All clients running, signature requested, and produced without errors after 12 blocks.

Scenario 2: All clients running, signature requested but one client goes down during execution. Other clients wait until the 100 blocks timeout gets hit and start the second attempt which completes with success after 15 blocks.

Scenario 3: All clients running, signature requested but resource (CPU) shortage occurs during the processing (DKG round two happens at the same time). Execution slows down but finally, the signature is produced after 20 blocks.

Scenario 4: All clients running, signature requested but some members start with a delay (temporary hack in the code). Signature produced after 18 blocks.

pdyraga · 2022-10-26T07:32:56Z

Scenarios tested:

Scenario 1: All clients running, signature requested, and produced successfully.

Scenario 2: All clients running, signature requested but one client goes down during execution. Other clients wait until the 100 blocks timeout gets hit and start the second attempt which completes successfully.

There are some suggestions left in this PR but I'll try to address them separately. This PR is a huge improvement and the suggestions require some experimentation.

Refs #3366 See #3379 (comment) See 16a91a1 Previously we were identifying retransmission handlers by context. This did work well when each state for each member had a separate context in the synchronous state machine. It did not work well though when the same context was shared for all states in the async state machine. Given that the retransmission ticker identified retransmission functions by context, new retransmission requests were overwriting previous ones using the same context. We fixed it temporarily by creating new instances of child contexts but that was not a final solution we wanted to go with. This commit fixes the problem in a proper way (TM). Each retransmission has a unique ID and no matter if the same or separate contexts are used, registering new retransmission requests does not overwrite existing ones. Tested this PR by running a single signing that succeeded in ~3 minutes (with announcement phase included).

Refs #3366 See #3379 (comment) In the async state machine we gather all messages received, also those from the "future" states. Blocking `Initiate` makes sense for the sync state machine, where we may want to initiate a state before we are ready to accept messages. For the async state machine, we want to accept messages all the time and blocking `Initiate` is not needed. Having `Initiate` running asynchronously will allow us to `Recv` messages to states and do not overflow the buffer in the state machine in case the computations take longer than expected. It should also allow for a faster execution of states because the messages are received also at the time of Initiating the state so we can potentially transition right after the computations are done. The state machine should not be asking the state if it `CanTransition` if `Initiate` of the state has not been completed yet. This was guaranteed with a blocking `Initiate` and has been maintained for the non-blocking `Initiate`. I've made it clear in the documentation of the AsyncState. This change has been done in 428e013. The next commit, 4a8a8c5, is a proof how we could simplify state code with that change. I recommend reviewing the two commits separately.

Here we integrate the tECDSA DKG protocol with the async state machine introduced by #3362. The presented changes are analogous to the recent changes introduced to the tECDSA signing protocol by #3379. Long story short, we no longer use a fixed block duration for protocol's states but move to a global timeout for the entire protocol attempt and switch states as soon as possible. As result, the time of a single DKG attempt can be reduced as the protocol is no longer fixed and can adjust to the slowest participant. The aforementioned change in the DKG protocol also involves a similar change in the DKG result publication step. The publication step is now based on the asynchronous state machine as well. This has two major consequences: - Firstly, so far, the participants were waiting for at least a group quorum of DKG result signatures for a fixed block duration and then moved to the result submission state. Now, all members wait for the actual group size of DKG result signatures and move to the result submission. This way, we are introducing an additional health check of group participants at the result publication step thus we maximize the final number of operating group members. This is based on the assumption that if DKG went to the publication step, all participants had to complete the computation step successfully, so they should be still operating for publication. - Secondly, the result submission queue is now time-based. Since the publication step relies on the async state machine, we can no longer use the block counter to set the submission eligibility queue. Each member preserves a time delay, according to its member index, before attempting to submit the DKG result Last but not least, we were forced to improve the message retransmission mechanism. As the DKG's TSS round two is computationally expensive, it takes ~100 blocks to complete their computations. The current retransmission strategy retransmits every message on every new block. On TSS round two, each member already retransmits messages from the ephemeral key generation state and from TSS round one. If the TSS round two state takes ~100 blocks, the amount of messages flying around the network is really huge and often caused network congestion. Also, the TSS round two state always takes almost all available CPU so, the situation became even worse. This resulted in a poor DKG success rate. To improve the situation and lower the number of messages passed through the network, we introduced a retransmission backoff strategy (#3396). All messages are now retransmitted at a decreasing rate so older messages are retransmitted less and less frequently. This allowed us to achieve a good success rate of the DKG protocol. Changes presented in this PR were tested on a local environment with three clients. Performed five DKGs and all were completed with success after ~18 minutes. That means we achieved an improvement as the DKG based on the sync machine always took ~26 minutes (fixed duration of 125 computation blocks and 5 publication blocks with an average block time of 12 seconds)

lukasz-zimnoch added 3 commits October 25, 2022 11:13

Integrate fast state machine with the tECDSA signing protocol

110ebac

Here we integrate the fast state machine with the tECDSA signing protocol. All states defined in the `pkg/tecdsa/signing` now conform the fast state interface. Also, the fast state machine is now used as state executor.

Format code changes

2b963c0

lukasz-zimnoch requested a review from pdyraga October 25, 2022 09:19

lukasz-zimnoch self-assigned this Oct 25, 2022

lukasz-zimnoch added the 📟 client label Oct 25, 2022

lukasz-zimnoch added this to the v2.0.0-m2 milestone Oct 25, 2022

lukasz-zimnoch added 2 commits October 25, 2022 12:16

Unit test for state message helpers

ccc6d24

Make formatter happy

f634943

pdyraga reviewed Oct 25, 2022

View reviewed changes

lukasz-zimnoch marked this pull request as ready for review October 25, 2022 10:21

pdyraga reviewed Oct 25, 2022

View reviewed changes

lukasz-zimnoch added 3 commits October 25, 2022 12:55

Fix failing signing retry loop tests

dca09b0

Merge branch 'main' into ecdsa-signing-faststate

54cdf7e

# Conflicts: # pkg/tecdsa/signing/signing.go # pkg/tecdsa/signing/states.go

Post-merge unit tests fix

df7e670

pdyraga reviewed Oct 25, 2022

View reviewed changes

Exit the async machine execution loop upon context cancellation

16007dd

The state machine execution loop should exit once the machine context is done. Otherwise, the loop will work forever causing a resource leak.

Delay the stop pill

b2b033e

The stop pill should be broadcasted after a delay to give other participants some time to complete the result computation.

pdyraga reviewed Oct 25, 2022

View reviewed changes

pdyraga approved these changes Oct 26, 2022

View reviewed changes

pdyraga merged commit 9b34493 into main Oct 26, 2022

pdyraga deleted the ecdsa-signing-faststate branch October 26, 2022 07:33

pdyraga mentioned this pull request Oct 26, 2022

Assign each retransmission request a unique id #3383

Merged

pdyraga mentioned this pull request Oct 26, 2022

Asynchronous state machine: do not block on state initiation #3385

Merged

lukasz-zimnoch mentioned this pull request Nov 3, 2022

tECDSA DKG: Integration with the asynchronous state machine #3392

Merged

pdyraga unassigned lukasz-zimnoch Nov 10, 2022

Conversation

lukasz-zimnoch commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-zimnoch Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdyraga commented Oct 25, 2022

Uh oh!

lukasz-zimnoch commented Oct 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-zimnoch Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-zimnoch commented Oct 25, 2022

Uh oh!

pdyraga commented Oct 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lukasz-zimnoch commented Oct 25, 2022 •

edited

Loading

lukasz-zimnoch Oct 25, 2022 •

edited

Loading

lukasz-zimnoch Oct 25, 2022 •

edited

Loading