Skip to content

Block miner computations until everyone agrees to resume#3174

Merged
lukasz-zimnoch merged 4 commits intomainfrom
pre-params-5
Aug 18, 2022
Merged

Block miner computations until everyone agrees to resume#3174
lukasz-zimnoch merged 4 commits intomainfrom
pre-params-5

Conversation

@pdyraga
Copy link
Copy Markdown
Member

@pdyraga pdyraga commented Aug 17, 2022

Refs #3161
Depends on #3172

From now on, Miner has an internal latch counting how many times Stop() function has been called. Miner expects Resume() function to be called the same number of times. The reasoning behind it is that multiple goroutines can call Stop() and we do not want the computations to unblock until the last goroutine is done.

For example:

  1. Relay entry signing starts, computations are stopped.
  2. ECDSA DKG starts, computations are already stopped.
  3. Relay entry signing completes. Relay entry goroutine asks the computations to resume.

With the change in this PR we want to resume the computations only when all goroutines that requested to stop them have signalled that they completed their work and computations can be resumed.

The signalled element is crucial. We really need all of them to tell the miner that the computations can be resumed. We do not know when the operations that are supposed to stop the computations are invoked and how long they are going to take. I was playing with an idea of a timeout but that turned out to be complicated and error-prone.

Since we need to trust goroutines to call Resume, it means we assume goroutines are working as expected. I am not totally in love with this assumption but I could not figure out anything else if we do not want the generator to just keep running in the background.

If we assume goroutines are working as expected and always call Resume after Stop, it also means we assume they are not going to call Resume twice because it breaks the assumption of goroutines being correctly implemented.

In the current setup, the lock is acquired only to actually stop or resume goroutines and after the lock is acquired we check the state. Stop and Resume are not atomic but I think everything should work fine under the assumption goroutines are:

  1. Always calling Stop() before Resume()
  2. They never forget to call Resume() if they called Stop().

I am open to all suggestions on how to improve this mechanism. I could come up only with this solution if we want to stop the miner. I was also thinking about abandoning this path completely and reworking the pre-params generator code in tss-lib to always work in a single thread, even if generating pre-params will take ages but I don't think it's a better option. 4 years ago, when we were working on the safe prime generator and were optimizing the code pretty heavily, we decided to go with concurrency (benchmarks). What appeals to me about the current approach is that in the real world, we'll not have thousands of goroutines on a single client. Just a few executing DKG or relay entry generation and the call to the miner is quite simple: miner.Stop(); defer miner.Resume().

@pdyraga pdyraga requested a review from lukasz-zimnoch August 17, 2022 12:24
@pdyraga pdyraga added this to the v2.0.0-m3 milestone Aug 17, 2022
@pdyraga pdyraga self-assigned this Aug 17, 2022
From now on, Miner has an internal latch counting how many times `Stop()`
function has been called. Miner expects `Resume()` function to be called the
same number of times. The reasoning behind it is that multiple goroutines can
call `Stop()` and we do not want the computations to unblock until the last
goroutine is done.

For example:
1. Relay entry signing starts, computations are stopped.
2. ECDSA DKG starts, computations are already stopped.
3. Relay entry signing completes. Relay entry goroutine asks the computations
   to resume.

Computations should not resume until ECDSA DKG goroutine asks to resume them.
This approach requires really good hygiene with calls to miner - never forget
to resume if you are stopping.
Base automatically changed from pre-params-4 to main August 18, 2022 06:27
@pdyraga pdyraga marked this pull request as ready for review August 18, 2022 06:44
Comment thread pkg/miner/miner.go
Comment thread pkg/miner/miner_test.go
Comment thread pkg/miner/miner.go
@lukasz-zimnoch
Copy link
Copy Markdown
Member

lukasz-zimnoch commented Aug 18, 2022

Actually I think we have the following alternative to the presented approach:

  1. The miner has an internal control loop that decides about resuming and stopping the work based on a stop predicate provided as a constructor parameter (simple boolean function) and checked every X seconds. If the predicate is true, it stops. If false, it resumes.
  2. The actual predicate is a boolean function that will do the conjunction of two other boolean function calls. Let's name them tbtc.node.IsWorking() and tbtc.beacon.IsWorking()
  3. Respective IsWorking functions return a boolean answering the question: is the total count of protocol goroutines bigger than 0?
  4. Respective node implementations can expose a helper function triggerProtocol that will run goroutines for respective protocols and do the accounting that will be used by IsWorking. The triggerProtocol function will be invoked for each member of DKG, signing, etc.

This way we don't need to bother about controlling the miner lifecycle by client goroutines. The lifecycle controller will be single-threaded so there will be no concurrency problems as well. Also, applications that don't need a reference to the miner (beacon) will not be forced to manage it. We can also reuse that mechanism to expose metrics if needed.

WDYT?

Clearly express the requirements for goroutines calling Stop() and
Resume().
@pdyraga
Copy link
Copy Markdown
Member Author

pdyraga commented Aug 18, 2022

I think this option is definitely worth considering and experimenting with. Regarding goroutines assumptions, I think we'll have the same as now because it is not enough to say "DKG is happening" - we need to know if the current node participates in that DKG and when the last goroutine finished.

But! This may be simpler in terms of concurrency because we should be able to eliminate the latch. Also, we'll avoid passing miner reference around which I already tried locally and didn't like.

I would do the following: let's review/merge #3177 and #3180 to have the current approach finished in terms of miner and pool (without yet stopping/resuming miner from goroutines) and then I'll start experimenting with the approach you suggested.

lukasz-zimnoch
lukasz-zimnoch previously approved these changes Aug 18, 2022
Removed deferred call to miner.Stop() given this call is already done in
the body of the test and miner.Stop() can not be called twice without
resuming the sane number of times.
@lukasz-zimnoch lukasz-zimnoch merged commit 99f3d8c into main Aug 18, 2022
@lukasz-zimnoch lukasz-zimnoch deleted the pre-params-5 branch August 18, 2022 10:41
@pdyraga pdyraga removed their assignment Sep 30, 2022
@pdyraga pdyraga modified the milestones: v2.0.0-m3, v2.0.0-m4 Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants