Block miner computations until everyone agrees to resume by pdyraga · Pull Request #3174 · threshold-network/keep-core

pdyraga · 2022-08-17T12:23:53Z

Refs #3161
Depends on #3172

From now on, Miner has an internal latch counting how many times Stop() function has been called. Miner expects Resume() function to be called the same number of times. The reasoning behind it is that multiple goroutines can call Stop() and we do not want the computations to unblock until the last goroutine is done.

For example:

Relay entry signing starts, computations are stopped.
ECDSA DKG starts, computations are already stopped.
Relay entry signing completes. Relay entry goroutine asks the computations to resume.

With the change in this PR we want to resume the computations only when all goroutines that requested to stop them have signalled that they completed their work and computations can be resumed.

The signalled element is crucial. We really need all of them to tell the miner that the computations can be resumed. We do not know when the operations that are supposed to stop the computations are invoked and how long they are going to take. I was playing with an idea of a timeout but that turned out to be complicated and error-prone.

Since we need to trust goroutines to call Resume, it means we assume goroutines are working as expected. I am not totally in love with this assumption but I could not figure out anything else if we do not want the generator to just keep running in the background.

If we assume goroutines are working as expected and always call Resume after Stop, it also means we assume they are not going to call Resume twice because it breaks the assumption of goroutines being correctly implemented.

In the current setup, the lock is acquired only to actually stop or resume goroutines and after the lock is acquired we check the state. Stop and Resume are not atomic but I think everything should work fine under the assumption goroutines are:

Always calling Stop() before Resume()
They never forget to call Resume() if they called Stop().

I am open to all suggestions on how to improve this mechanism. I could come up only with this solution if we want to stop the miner. I was also thinking about abandoning this path completely and reworking the pre-params generator code in tss-lib to always work in a single thread, even if generating pre-params will take ages but I don't think it's a better option. 4 years ago, when we were working on the safe prime generator and were optimizing the code pretty heavily, we decided to go with concurrency (benchmarks). What appeals to me about the current approach is that in the real world, we'll not have thousands of goroutines on a single client. Just a few executing DKG or relay entry generation and the call to the miner is quite simple: miner.Stop(); defer miner.Resume().

From now on, Miner has an internal latch counting how many times `Stop()` function has been called. Miner expects `Resume()` function to be called the same number of times. The reasoning behind it is that multiple goroutines can call `Stop()` and we do not want the computations to unblock until the last goroutine is done. For example: 1. Relay entry signing starts, computations are stopped. 2. ECDSA DKG starts, computations are already stopped. 3. Relay entry signing completes. Relay entry goroutine asks the computations to resume. Computations should not resume until ECDSA DKG goroutine asks to resume them. This approach requires really good hygiene with calls to miner - never forget to resume if you are stopping.

lukasz-zimnoch · 2022-08-18T09:23:30Z

Actually I think we have the following alternative to the presented approach:

The miner has an internal control loop that decides about resuming and stopping the work based on a stop predicate provided as a constructor parameter (simple boolean function) and checked every X seconds. If the predicate is true, it stops. If false, it resumes.
The actual predicate is a boolean function that will do the conjunction of two other boolean function calls. Let's name them tbtc.node.IsWorking() and tbtc.beacon.IsWorking()
Respective IsWorking functions return a boolean answering the question: is the total count of protocol goroutines bigger than 0?
Respective node implementations can expose a helper function triggerProtocol that will run goroutines for respective protocols and do the accounting that will be used by IsWorking. The triggerProtocol function will be invoked for each member of DKG, signing, etc.

This way we don't need to bother about controlling the miner lifecycle by client goroutines. The lifecycle controller will be single-threaded so there will be no concurrency problems as well. Also, applications that don't need a reference to the miner (beacon) will not be forced to manage it. We can also reuse that mechanism to expose metrics if needed.

WDYT?

Clearly express the requirements for goroutines calling Stop() and Resume().

pdyraga · 2022-08-18T09:35:33Z

I think this option is definitely worth considering and experimenting with. Regarding goroutines assumptions, I think we'll have the same as now because it is not enough to say "DKG is happening" - we need to know if the current node participates in that DKG and when the last goroutine finished.

But! This may be simpler in terms of concurrency because we should be able to eliminate the latch. Also, we'll avoid passing miner reference around which I already tried locally and didn't like.

I would do the following: let's review/merge #3177 and #3180 to have the current approach finished in terms of miner and pool (without yet stopping/resuming miner from goroutines) and then I'll start experimenting with the approach you suggested.

Removed deferred call to miner.Stop() given this call is already done in the body of the test and miner.Stop() can not be called twice without resuming the sane number of times.

pdyraga requested a review from lukasz-zimnoch August 17, 2022 12:24

pdyraga added the 📟 client label Aug 17, 2022

pdyraga added this to the v2.0.0-m3 milestone Aug 17, 2022

pdyraga self-assigned this Aug 17, 2022

pdyraga force-pushed the pre-params-5 branch from e2d5ebf to 25468fd Compare August 17, 2022 14:06

pdyraga mentioned this pull request Aug 17, 2022

Updated TSS lib dependency #3177

Merged

Base automatically changed from pre-params-4 to main August 18, 2022 06:27

pdyraga marked this pull request as ready for review August 18, 2022 06:44

lukasz-zimnoch reviewed Aug 18, 2022

View reviewed changes

Comment thread pkg/miner/miner.go

Comment thread pkg/miner/miner_test.go

Comment thread pkg/miner/miner.go

Added description to TestMineStopResumeReentrant

e144844

Improved documentation of miner structure

5b40f86

Clearly express the requirements for goroutines calling Stop() and Resume().

lukasz-zimnoch previously approved these changes Aug 18, 2022

View reviewed changes

lukasz-zimnoch enabled auto-merge August 18, 2022 09:39

Two fixes to pool_test.go: redundant Stop calls

e852f37

Removed deferred call to miner.Stop() given this call is already done in the body of the test and miner.Stop() can not be called twice without resuming the sane number of times.

pdyraga dismissed lukasz-zimnoch’s stale review via e852f37 August 18, 2022 10:24

lukasz-zimnoch approved these changes Aug 18, 2022

View reviewed changes

lukasz-zimnoch merged commit 99f3d8c into main Aug 18, 2022

lukasz-zimnoch deleted the pre-params-5 branch August 18, 2022 10:41

pdyraga removed their assignment Sep 30, 2022

pdyraga modified the milestones: v2.0.0-m3, v2.0.0-m4 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block miner computations until everyone agrees to resume#3174

Block miner computations until everyone agrees to resume#3174
lukasz-zimnoch merged 4 commits intomainfrom
pre-params-5

pdyraga commented Aug 17, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukasz-zimnoch commented Aug 18, 2022 •

edited

Loading

Uh oh!

pdyraga commented Aug 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pdyraga commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukasz-zimnoch commented Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdyraga commented Aug 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pdyraga commented Aug 17, 2022 •

edited

Loading

lukasz-zimnoch commented Aug 18, 2022 •

edited

Loading