Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - hare active set: bootstrap and use first consensus block #4241

Closed
wants to merge 15 commits into from

Conversation

countvonzero
Copy link
Contributor

@countvonzero countvonzero commented Apr 2, 2023

Motivation

part of #4089

Changes

  • hare

    • bootstrap/fallback: whenever an update is received for epoch N, use it for epoch N's active set
    • steady state: use the 1st hare output in epoch N as active set for epoch N (ActiveSet_N)
    • confidence param:
      • the bootstrap active set (targeting for epoch 2), is used btwn [FirstLayer_2, FirstLayer_3+confidence_param)
      • ActiveSet_N will be used btwn [FirstLayer_N+confidence_param, FirstLayer_N+1+confidence_param)
  • beacon:

    • whenever an update is received for an epoch, always use it as the beacon for the epoch
  • add debug RPC to query the hare active set used by a node for a given epoch

  • add a new systest TestFallback where bootstrapper will update continuously, and check the nodes use the updated value as beacon/activeset instead of the one calculated locally.

  • add metrics for update listener for query and update outcome.

  • add metrics for block certifier to count number of certificate generated/synced for each epoch.

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Apr 2, 2023
@countvonzero countvonzero mentioned this pull request Apr 2, 2023
3 tasks
@codecov
Copy link

codecov bot commented Apr 2, 2023

Codecov Report

Merging #4241 (01d71a6) into develop (924f1f4) will decrease coverage by 0.1%.
The diff coverage is 75.5%.

@@            Coverage Diff            @@
##           develop   #4241     +/-   ##
=========================================
- Coverage     76.1%   76.1%   -0.1%     
=========================================
  Files          239     240      +1     
  Lines        25026   25086     +60     
=========================================
+ Hits         19059   19098     +39     
- Misses        4753    4764     +11     
- Partials      1214    1224     +10     
Impacted Files Coverage Δ
common/types/layer.go 81.0% <ø> (ø)
sql/blocks/blocks.go 73.8% <50.0%> (ø)
blocks/metrics.go 65.2% <65.2%> (ø)
beacon/beacon.go 77.5% <66.6%> (-0.1%) ⬇️
bootstrap/updater.go 70.2% <71.4%> (-0.4%) ⬇️
sql/ballots/ballots.go 70.4% <72.0%> (-0.3%) ⬇️
sql/beacons/beacons.go 83.7% <72.7%> (-4.7%) ⬇️
hare/eligibility/oracle.go 64.4% <74.3%> (-1.3%) ⬇️
api/grpcserver/debug_service.go 67.4% <75.0%> (+0.7%) ⬆️
sql/certificates/certs.go 76.1% <85.0%> (+1.9%) ⬆️
... and 4 more

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@bors
Copy link

bors bot commented Apr 2, 2023

try

Build failed:

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Apr 2, 2023
@bors
Copy link

bors bot commented Apr 3, 2023

try

Build failed:

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Apr 3, 2023
@bors
Copy link

bors bot commented Apr 3, 2023

try

Build succeeded:

@bors
Copy link

bors bot commented Apr 3, 2023

try

Timed out.

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Apr 12, 2023
@bors
Copy link

bors bot commented Apr 12, 2023

try

Build failed:

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Apr 13, 2023
@bors
Copy link

bors bot commented Apr 13, 2023

try

Build succeeded:

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Apr 13, 2023
@bors
Copy link

bors bot commented Apr 13, 2023

try

Build failed:

@dshulyak
Copy link
Contributor

dshulyak commented Apr 13, 2023

i thought that we want block in consensus (e.g tortoise consensus), ofcourse hare is a prerequisite for that. but there is a difference. if tortoise changes its opinion, what do we want to have as an active set?

if we want tortoise output, then implementation should be a bit different. it is not ok to pick first and cache it forever, instead it makes sense re-pick it if tortoise changed opinion. i also thought that we want to monitor actual empty layers in consensus output. so if it was non-empty and then became empty.. what should be done in such case?

epoch := block.LayerIndex.GetEpoch()
s.blockCount[epoch]++
delete(s.blockCount, epoch-2)
}
Copy link
Contributor

@dshulyak dshulyak Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this metric might be useful for this component, but in general it has some limitations. node can download and apply block bypassing this generator and it should be good enough for fallback monitoring purposes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've updated the metrics to track block certificate instead. after all that's what we are looking for.

@countvonzero
Copy link
Contributor Author

i thought that we want block in consensus (e.g tortoise consensus), ofcourse hare is a prerequisite for that. but there is a difference. if tortoise changes its opinion, what do we want to have as an active set?

if we want tortoise output, then implementation should be a bit different. it is not ok to pick first and cache it forever, instead it makes sense re-pick it if tortoise changed opinion. i also thought that we want to monitor actual empty layers in consensus output. so if it was non-empty and then became empty.. what should be done in such case?

@dshulyak the original discussion is to piggyback on hare consensus. the argument is that if there is a block certificate, that means the network agreed. it's not so much about which block is correct for that layer, but that everyone agreed that hare output that block at that time, since all we need is an active set nodes can agree on.

the code path for finding a tortoise verified block is a little more complicated and require consideration for revert as you stated.

Copy link
Contributor

@dshulyak dshulyak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it makes sense if it is not going to be a long term solution

func (d DebugService) ActiveSet(ctx context.Context, req *pb.ActiveSetRequest) (*pb.ActiveSetResponse, error) {
actives, err := d.oracle.ActiveSet(ctx, types.EpochID(req.Epoch))
if err != nil {
log.With().Error("failed to get active set", log.Uint32("epoch", req.Epoch), log.Err(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer return details in status.Errorf and drop this

beacon/beacon.go Outdated
pd.logger.With().Info("received beacon update", epoch, beacon)
pd.mu.Lock()
defer pd.mu.Unlock()
if err := beacons.AddOverwrite(pd.cdb, epoch, beacon); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Set is a better name

defer pd.mu.Unlock()
if err := beacons.AddOverwrite(pd.cdb, epoch, beacon); err != nil {
pd.logger.With().Error("failed to persist fallback beacon", epoch, beacon, log.Err(err))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error handling is not sound. i suggest to return an error and fail the node if it is not possible to persist something (or retry, until it is possible to persist it)

it is definitely not an option just to continue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to fail the node.

@@ -60,6 +63,29 @@ func defaultConfig() Config {
}
}

type stats struct {
lock sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mu is more common

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. noted. moved the code to certifier.go and reuse the mutex there.

func (s *stats) Get() map[types.EpochID]int {
s.lock.Lock()
defer s.lock.Unlock()
result := make(map[types.EpochID]int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better to implement such methods without allocating memory on heap

namespace,
"number of queries made to the update site",
[]string{"outcome"},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe track latency

rows int
)
if rows, err = db.Exec(`
select block from certificates where layer between ?1 and ?2 and valid = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we keep track of invalid certificates? thats a waste

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't keep track of those that were invalid in the first place.
we only keep certificate that was valid before but later became invalid.

  1. hare output but has no certificate to back it up (this db entry will have certificate as null)
  2. a certificate that was valid when received but no longer valid when the 2nd valid certificate is received.

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Apr 16, 2023
@bors
Copy link

bors bot commented Apr 16, 2023

try

Build succeeded:

@countvonzero
Copy link
Contributor Author

bors merge

bors bot pushed a commit that referenced this pull request Apr 16, 2023
## Motivation
part of #4089 

## Changes
- hare
  - bootstrap/fallback: whenever an update is received for epoch N, use it for epoch N's active set 
  - steady state: use the 1st hare output in epoch N as active set for epoch N (`ActiveSet_N`)
  - confidence param:
    - the bootstrap active set (targeting for epoch 2), is used btwn [`FirstLayer_2`, `FirstLayer_3`+`confidence_param`)
    - `ActiveSet_N` will be used btwn [`FirstLayer_N`+`confidence_param`, `FirstLayer_N+1`+`confidence_param`)

- beacon: 
  - whenever an update is received for an epoch, always use it as the beacon for the epoch

- add debug RPC to query the hare active set used by a node for a given epoch
- add a new systest TestFallback where bootstrapper will update continuously, and check the nodes use the updated value as beacon/activeset instead of the one calculated locally.
- add metrics for update listener for query and update outcome.
- add metrics for block certifier to count number of certificate generated/synced for each epoch.
@bors
Copy link

bors bot commented Apr 16, 2023

Pull request successfully merged into develop.

Build succeeded:

@bors bors bot changed the title hare active set: bootstrap and use first consensus block [Merged by Bors] - hare active set: bootstrap and use first consensus block Apr 16, 2023
@bors bors bot closed this Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants