-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - hare active set: bootstrap and use first consensus block #4241
Conversation
bors try |
Codecov Report
@@ Coverage Diff @@
## develop #4241 +/- ##
=========================================
- Coverage 76.1% 76.1% -0.1%
=========================================
Files 239 240 +1
Lines 25026 25086 +60
=========================================
+ Hits 19059 19098 +39
- Misses 4753 4764 +11
- Partials 1214 1224 +10
... and 3 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
tryBuild failed: |
bors try |
tryBuild failed: |
bors try |
tryBuild succeeded: |
tryTimed out. |
b147cac
to
747d472
Compare
bors try |
tryBuild failed: |
bors try |
tryBuild succeeded: |
bors try |
tryBuild failed: |
i thought that we want block in consensus (e.g tortoise consensus), ofcourse hare is a prerequisite for that. but there is a difference. if tortoise changes its opinion, what do we want to have as an active set? if we want tortoise output, then implementation should be a bit different. it is not ok to pick first and cache it forever, instead it makes sense re-pick it if tortoise changed opinion. i also thought that we want to monitor actual empty layers in consensus output. so if it was non-empty and then became empty.. what should be done in such case? |
blocks/generator.go
Outdated
epoch := block.LayerIndex.GetEpoch() | ||
s.blockCount[epoch]++ | ||
delete(s.blockCount, epoch-2) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this metric might be useful for this component, but in general it has some limitations. node can download and apply block bypassing this generator and it should be good enough for fallback monitoring purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've updated the metrics to track block certificate instead. after all that's what we are looking for.
@dshulyak the original discussion is to piggyback on hare consensus. the argument is that if there is a block certificate, that means the network agreed. it's not so much about which block is correct for that layer, but that everyone agreed that hare output that block at that time, since all we need is an active set nodes can agree on. the code path for finding a tortoise verified block is a little more complicated and require consideration for revert as you stated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess it makes sense if it is not going to be a long term solution
api/grpcserver/debug_service.go
Outdated
func (d DebugService) ActiveSet(ctx context.Context, req *pb.ActiveSetRequest) (*pb.ActiveSetResponse, error) { | ||
actives, err := d.oracle.ActiveSet(ctx, types.EpochID(req.Epoch)) | ||
if err != nil { | ||
log.With().Error("failed to get active set", log.Uint32("epoch", req.Epoch), log.Err(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i prefer return details in status.Errorf and drop this
beacon/beacon.go
Outdated
pd.logger.With().Info("received beacon update", epoch, beacon) | ||
pd.mu.Lock() | ||
defer pd.mu.Unlock() | ||
if err := beacons.AddOverwrite(pd.cdb, epoch, beacon); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Set is a better name
defer pd.mu.Unlock() | ||
if err := beacons.AddOverwrite(pd.cdb, epoch, beacon); err != nil { | ||
pd.logger.With().Error("failed to persist fallback beacon", epoch, beacon, log.Err(err)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error handling is not sound. i suggest to return an error and fail the node if it is not possible to persist something (or retry, until it is possible to persist it)
it is definitely not an option just to continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to fail the node.
blocks/generator.go
Outdated
@@ -60,6 +63,29 @@ func defaultConfig() Config { | |||
} | |||
} | |||
|
|||
type stats struct { | |||
lock sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mu is more common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. noted. moved the code to certifier.go and reuse the mutex there.
blocks/generator.go
Outdated
func (s *stats) Get() map[types.EpochID]int { | ||
s.lock.Lock() | ||
defer s.lock.Unlock() | ||
result := make(map[types.EpochID]int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is better to implement such methods without allocating memory on heap
namespace, | ||
"number of queries made to the update site", | ||
[]string{"outcome"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe track latency
sql/certificates/certs.go
Outdated
rows int | ||
) | ||
if rows, err = db.Exec(` | ||
select block from certificates where layer between ?1 and ?2 and valid = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we keep track of invalid certificates? thats a waste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't keep track of those that were invalid in the first place.
we only keep certificate that was valid before but later became invalid.
- hare output but has no certificate to back it up (this db entry will have certificate as null)
- a certificate that was valid when received but no longer valid when the 2nd valid certificate is received.
bors try |
tryBuild succeeded: |
bors merge |
## Motivation part of #4089 ## Changes - hare - bootstrap/fallback: whenever an update is received for epoch N, use it for epoch N's active set - steady state: use the 1st hare output in epoch N as active set for epoch N (`ActiveSet_N`) - confidence param: - the bootstrap active set (targeting for epoch 2), is used btwn [`FirstLayer_2`, `FirstLayer_3`+`confidence_param`) - `ActiveSet_N` will be used btwn [`FirstLayer_N`+`confidence_param`, `FirstLayer_N+1`+`confidence_param`) - beacon: - whenever an update is received for an epoch, always use it as the beacon for the epoch - add debug RPC to query the hare active set used by a node for a given epoch - add a new systest TestFallback where bootstrapper will update continuously, and check the nodes use the updated value as beacon/activeset instead of the one calculated locally. - add metrics for update listener for query and update outcome. - add metrics for block certifier to count number of certificate generated/synced for each epoch.
Pull request successfully merged into develop. Build succeeded: |
Motivation
part of #4089
Changes
hare
ActiveSet_N
)FirstLayer_2
,FirstLayer_3
+confidence_param
)ActiveSet_N
will be used btwn [FirstLayer_N
+confidence_param
,FirstLayer_N+1
+confidence_param
)beacon:
add debug RPC to query the hare active set used by a node for a given epoch
add a new systest TestFallback where bootstrapper will update continuously, and check the nodes use the updated value as beacon/activeset instead of the one calculated locally.
add metrics for update listener for query and update outcome.
add metrics for block certifier to count number of certificate generated/synced for each epoch.