Skip to content

tECDSA signing: Integration with the asynchronous state machine#3379

Merged
pdyraga merged 10 commits intomainfrom
ecdsa-signing-faststate
Oct 26, 2022
Merged

tECDSA signing: Integration with the asynchronous state machine#3379
pdyraga merged 10 commits intomainfrom
ecdsa-signing-faststate

Conversation

@lukasz-zimnoch
Copy link
Copy Markdown
Member

@lukasz-zimnoch lukasz-zimnoch commented Oct 25, 2022

Refs: #3366

Here we integrate the tECDSA signing protocol with the asynchronous state machine introduced by #3362. Changes presented in this pull request involve changes in the pkg/tecdsa/signing states implementations in order to conform to the asynchronous state interface. As result, the async state machine becomes the default runner used within signing.Execute function. Thanks to the changes presented in this PR, the time of a single signing attempt can be greatly reduced as the protocol is no longer fixed and can adjust to the slowest participant.

Here we integrate the fast state machine with the tECDSA signing
protocol. All states defined in the `pkg/tecdsa/signing` now
conform the fast state interface. Also, the fast state machine
is now used as state executor.
…ssions

We now use a global context in the state machine which is active until the
timeout block. This context is used in all `channel.Send` calls that manage
retransmissions under the hood. The context is used as key in the map that holds
the functions responsible for specific retransmissions. If we call `channel
.Send` with the same context for different messages, the latest call will
overwrite previous retransmissions with the current one. To avoid that,
we use a separate child context for each message.
@lukasz-zimnoch lukasz-zimnoch self-assigned this Oct 25, 2022
@lukasz-zimnoch lukasz-zimnoch added this to the v2.0.0-m2 milestone Oct 25, 2022
return ephemeralKeyPairStateDelayBlocks
}
func (ekpgs *ephemeralKeyPairGenerationState) Initiate(ctx context.Context) error {
ekpgs.action.run(func() error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to wrap the call to generateEphemeralKeyPair() with ekpgs.action.run? The state machine should guarantee Initiate is called only one time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about the number of calls to Initiate. The reasoning is as follows:

  • we need to wait until all ephemeral keys are generated and the state message is sent before we proceed to the next state
  • currently, the state machine implementation guarantees that Initiate is called before CanTransition but this is an implementation detail that can change in the future so we cannot consider it as a strong guarantee. Because of that, we should check whether the state completed its computations and sent its message before proceeding in the state implementation as well. We don't want to check state internals (i.e. the number of generated keys) as this is not the right abstraction level for that. Additionally, that check does not guarantee that channel.Send was called
  • Initiate should return as fast as possible as the message buffer can become filled meanwhile. Running an async action matches that requirement
  • Almost all states of signing perform async computations so using this pattern for ephemeralKeyPairGenerationState is justified as we have consistency across all states

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this is a problem with a state machine and not something the state implementation should worry about.

What do you think about:

  • Making AsyncState.Initiate asynchronous in the state machine.
  • Making AsyncState.Initiate return a channel when the computations are done.
  • Probably getting rid of stateAction structure.

I can try experimenting with this approach, if it works fine, we can open another PR that will be merged immediately after this one.

Copy link
Copy Markdown
Member Author

@lukasz-zimnoch lukasz-zimnoch Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that will make the state machine much complicated as we need to examine the state of computations triggered by Initiate in CanTransition. There is also a risk that we'll specialize the machine too much and it will not fit other possible use cases well enough. Of course, we can always drive some experiments and see.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this will move more logic to the state machine but right now this logic is outside of the state machine and complicates state implementation. We do not necessarily need to modify the machine, we could also move some of this logic to the base async state. Anyway, something to experiment with.

Some loose thoughts here:

  1. stateAction ensures the given action can be triggered only one time. This is something the Initiate interface should be clear about and ensure. The state implementation should just trust this assumption.

  2. In the async state machine we gather all messages received, also those from the future. Blocking Initiate makes sense for the sync state machine, where we may want to initiate a state before we are ready to accept messages. For the async state machine, we want to accept messages all the time and blocking Initiate is not really needed.

  3. On the abstract level, the state machine should not be asking the state if it CanTransition if Initiate of the state has not been completed yet, no matter if Initiate is blocking or not. This is happening when Initiate is blocking, but requires some more logic from the state if Initiate logic is async.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #3385.

@lukasz-zimnoch lukasz-zimnoch marked this pull request as ready for review October 25, 2022 10:21
// cannot be used directly because it is global for the entire
// state machine. Because of that, channel.Send must be called with
// a child context.
sendCtx := context.WithValue(ctx, "message", message.Type())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use WithValue? Is it to trick the ticker.go to distinguish contexts and not treat contexts for two different states as equal?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a simple way to create a context copy in Go apart wrapping it with context.WithValue. All other functions like WithCancel or WithDeadline produce a cancel function that must be managed somehow. We can't just ignore it as, in theory, we don't have a guarantee that the parent context is always canceled (that will cause context leaks). On the other hand, we cannot setup a proper call to the cancel function at this level. This is why the WithValue trick is a good option here, especially if we treat this solution as a temporary one until we fix the problem directly in ticker.go.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #3383.

Comment thread pkg/tbtc/signing.go

// signingAttemptMaxBlockDuration determines the maximum block duration of a
// single signing attempt.
const signingAttemptMaxBlockDuration = 100
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flagging that this value requires more thinking. 100 blocks assuming 12sec for a single block is 20 minutes.

With 2 malicious members in a signing group, we need 5 retries of the protocol in the worst case (P = (98 choose 51) / (100 choose 51) = 0.23757575757), which takes 1h 40min.

With 3 malicious members in a signing group, we need 10 retries of the protocol in the worst case (P = (97 choose 51) / (100 choose 51) = 0.11393939393), which takes 3h 20 min.

Say we want to sweep no more than 10 deposits at once, this will take almost 17 hours for 2 malicious members case and twice that much for 3 malicious members case. Not blocking this PR, rather something for threshold-network/tbtc-v2#414

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...plus the blocks for announcement and delay!

The state machine execution loop should exit once the machine context is
done. Otherwise, the loop will work forever causing a resource leak.
@pdyraga
Copy link
Copy Markdown
Member

pdyraga commented Oct 25, 2022

The code here so far looks good but I wasn't testing yet. Waiting for two more changes planned by @lukasz-zimnoch:

  • delay sending the stop pill,
  • exit the state machine's Execute when the context is done.

The comments left are for future PRs, I would like to do some experimentation with the state machine based on the comments.

The stop pill should be broadcasted after a delay to give other participants
some time to complete the result computation.
@lukasz-zimnoch
Copy link
Copy Markdown
Member Author

The code here so far looks good but I wasn't testing yet. Waiting for two more changes planned by @lukasz-zimnoch:

  • delay sending the stop pill,
  • exit the state machine's Execute when the context is done.

The comments left are for future PRs, I would like to do some experimentation with the state machine based on the comments.

Done in 16007dd and b2b033e

}

return result
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both receivedMessages and receivedMessages could be abstracted somewhere. I am not sure if the state package is the best place for it but we will have to repeat the same code over and over again. Not blocking this PR though, just flagging.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about it as well and had the same feelings.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #3386

func (ekpgs *ephemeralKeyPairGenerationState) Next() (state.SyncState, error) {
func (ekpgs *ephemeralKeyPairGenerationState) CanTransition() bool {
messagingDone := len(receivedMessages[*ephemeralPublicKeyMessage](ekpgs)) ==
len(ekpgs.member.group.OperatingMemberIDs())-1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to abstract it out and calculate once when instantiating the initial state. Non-blocking, something for the future.

Comment thread pkg/tbtc/node.go
// be more sophisticated but since it is temporary, we
// can live with it for now.
go func() {
time.Sleep(1 * time.Minute)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time has to be shorter than the delay between attempts, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a hard requirement. It will work even if this time is longer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then operators who were excluded from the current attempt may kick off the execution of the next attempt, correct?

Copy link
Copy Markdown
Member Author

@lukasz-zimnoch lukasz-zimnoch Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but once the stop pill arrives they will break. Even if the pill will not arrive they will just fail eventually. Anyway, the stop pill mechanism will need to be reworked or replaced eventually once we have integration with the BTC network.

@lukasz-zimnoch
Copy link
Copy Markdown
Member Author

Tested the following scenarios using a local environment with 3 clients.

Scenario 1: All clients running, signature requested, and produced without errors after 12 blocks.

Scenario 2: All clients running, signature requested but one client goes down during execution. Other clients wait until the 100 blocks timeout gets hit and start the second attempt which completes with success after 15 blocks.

Scenario 3: All clients running, signature requested but resource (CPU) shortage occurs during the processing (DKG round two happens at the same time). Execution slows down but finally, the signature is produced after 20 blocks.

Scenario 4: All clients running, signature requested but some members start with a delay (temporary hack in the code). Signature produced after 18 blocks.

@pdyraga
Copy link
Copy Markdown
Member

pdyraga commented Oct 26, 2022

Scenarios tested:

Scenario 1: All clients running, signature requested, and produced successfully.

Scenario 2: All clients running, signature requested but one client goes down during execution. Other clients wait until the 100 blocks timeout gets hit and start the second attempt which completes successfully.

There are some suggestions left in this PR but I'll try to address them separately. This PR is a huge improvement and the suggestions require some experimentation.

@pdyraga pdyraga merged commit 9b34493 into main Oct 26, 2022
@pdyraga pdyraga deleted the ecdsa-signing-faststate branch October 26, 2022 07:33
lukasz-zimnoch added a commit that referenced this pull request Oct 26, 2022
Refs #3366

See
#3379 (comment)
See
16a91a1

Previously we were identifying retransmission handlers by context. This
did work well when each state for each member had a separate context in
the synchronous state machine. It did not work well though when the same
context was shared for all states in the async state machine. Given that
the retransmission ticker identified retransmission functions by
context, new retransmission requests were overwriting previous ones
using the same context. We fixed it temporarily by creating new
instances of child contexts but that was not a final solution we wanted
to go with. This commit fixes the problem in a proper way (TM). Each
retransmission has a unique ID and no matter if the same or separate
contexts are used, registering new retransmission requests does not
overwrite existing ones.

Tested this PR by running a single signing that succeeded in ~3 minutes
(with announcement phase included).
lukasz-zimnoch added a commit that referenced this pull request Oct 28, 2022
Refs #3366
See
#3379 (comment)

In the async state machine we gather all messages received, also those
from the "future" states. Blocking `Initiate` makes sense for the sync
state machine, where we may want to initiate a state before we are ready
to accept messages. For the async state machine, we want to accept
messages all the time and blocking `Initiate` is not needed. Having
`Initiate` running asynchronously will allow us to `Recv` messages to
states and do not overflow the buffer in the state machine in case the
computations take longer than expected. It should also allow for a
faster execution of states because the messages are received also at the
time of Initiating the state so we can potentially transition right
after the computations are done.

The state machine should not be asking the state if it `CanTransition`
if `Initiate` of the state has not been completed yet. This was
guaranteed with a blocking `Initiate` and has been maintained for the
non-blocking `Initiate`. I've made it clear in the documentation of the
AsyncState.

This change has been done in 428e013.
The next commit, 4a8a8c5, is a proof
how we could simplify state code with that change. I recommend reviewing
the two commits separately.
pdyraga added a commit that referenced this pull request Nov 10, 2022
Here we integrate the tECDSA DKG protocol with the async state machine
introduced by #3362. The presented
changes are analogous to the recent changes introduced to the tECDSA signing
protocol by #3379. Long story short, we no longer use a fixed block duration for
protocol's states but move to a global timeout for the entire protocol attempt
and switch states as soon as possible. As result, the time of a single DKG
attempt can be reduced as the protocol is no longer fixed and can adjust to the
slowest participant.

The aforementioned change in the DKG protocol also involves a similar change in
the DKG result publication step. The publication step is now based on the
asynchronous state machine as well. This has two major consequences:
- Firstly, so far, the participants were waiting for at least a group quorum of
DKG result signatures for a fixed block duration and then moved to the result
submission state. Now, all members wait for the actual group size of DKG result
signatures and move to the result submission. This way, we are introducing an
additional health check of group participants at the result publication step
thus we maximize the final number of operating group members. This is based on
the assumption that if DKG went to the publication step, all participants had to
complete the computation step successfully, so they should be still operating
for publication. 
- Secondly, the result submission queue is now time-based. Since the publication
step relies on the async state machine, we can no longer use the block counter
to set the submission eligibility queue. Each member preserves a time delay,
according to its member index, before attempting to submit the DKG result

Last but not least, we were forced to improve the message retransmission
mechanism. As the DKG's TSS round two is computationally expensive, it takes
~100 blocks to complete their computations. The current retransmission strategy
retransmits every message on every new block. On TSS round two, each member
already retransmits messages from the ephemeral key generation state and from
TSS round one. If the TSS round two state takes ~100 blocks, the amount of
messages flying around the network is really huge and often caused network
congestion. Also, the TSS round two state always takes almost all available CPU
so, the situation became even worse. This resulted in a poor DKG success rate.
To improve the situation and lower the number of messages passed through the
network, we introduced a retransmission backoff strategy (#3396). All messages
are now retransmitted at a decreasing rate so older messages are retransmitted
less and less frequently. This allowed us to achieve a good success rate of the
DKG protocol.

Changes presented in this PR were tested on a local environment with three
clients. Performed five DKGs and all were completed with success after
~18 minutes. That means we achieved an improvement as the DKG based on the sync
machine always took ~26 minutes (fixed duration of 125 computation blocks and 5
publication blocks with an average block time of 12 seconds)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants