Skip to content

[Merged by Bors] - certifier: accept early certify message #3497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

countvonzero
Copy link
Contributor

@countvonzero countvonzero commented Aug 26, 2022

Motivation

Closes #3475

Changes

  • block certifier accept early CertifyMessage
  • syncer:
    • separate fetching layer data and layer opinions (only certificate for now)
    • only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer
  • fix sql bug where certificate is retrieved correctly and not overwritten once set
  • systest
    • fixed systests TestAddNode so that the test fails with the certificate sync failures
    • make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer

@dshulyak
Copy link
Contributor

dshulyak commented Aug 26, 2022

block certifier tries to wait for all CertifyMessage. the behavior now is a background thread generating certificate / pruning messages when every layer ticks. this can be reduced to smaller interval but i don't find it necessary. the reason to wait for all messages is that nodes in the network have different epoch weights. when node A finds a msg valid, node B may disagree. since these certificates are going to be pruned, size is not an issue. therefore we want do make best effort for as many nodes to generate a certificate, or accept one from peers, by having as many CertifyMessage in each Certificate.

but that is not relevant if someone disagrees. cert message is either valid according to your view or invalid, so once there are valid msgs with total weight > 400, whats the point to keep adding them?

i also think that it creates problems for p2p network, since anybody can broadcast invalid data and every peer will re-broadcast it. the only good option is to ignore certificates that are invalid according to your view

@dshulyak
Copy link
Contributor

block certifier accept early CertifyMessage and put them to a cache until the node's block is generatede.

eligibility can be validated even if message is early

Copy link
Contributor

@dshulyak dshulyak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • buffer is unnecessary. eligibility can be verified when received and saved into map[Layer][BlockID]
  • i don't think that it is a good option to store all signatures. i also don't understand whats the problem with doing it in a clean way. i would rather output certificate in one of two events:
    • register is called and we already have a certificate
    • otherwise when required units were received after register
  • it is pretty clear that we can't just keep them, as they are not validated when received. so anybody can spam node with fake certificates until it blows up. i think syncer should not be asking for certificates until it downloaded required ballots for beacon

mesh/mesh.go Outdated
// a synced certificate cannot be verified before the node learns the beacon value
// of the epoch. any certificate passed on by the syncer will stay here until
// beacon is available.
certCache map[types.LayerID][]*types.Certificate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that it is ok to add more mutable state randomly across the code. is there a way to refactor it in a better way?

can syncer download ballots and blocks before downloading opinions on them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. i modified syncer to fetch layer data and layer opinions separately.
the layer opinion is part of the state sync, after beacon is available.

@countvonzero
Copy link
Contributor Author

but that is not relevant if someone disagrees. cert message is either valid according to your view or invalid, so once there are valid msgs with total weight > 400, whats the point to keep adding them?

so during TestAddNodes i found that a new node that joined the network always have higher total weight than the online nodes,
because it fetched ATXs from its peers. so it often rejects the certificates it synced because it finds most those CertifyMessage invalid. this happens amongst the online nodes too.

say there are 800 msgs in the network (assuming every miner has eligibility 1)
node A collects [0, 400] to create CertificateA
node B collects [3, 403] to create CertificateB
you don't think node C (new node with 20(?) peers) has a better chance to accept node A/B's certificate if every certificate has all 800 CertifyMessage?

@dshulyak
Copy link
Contributor

dshulyak commented Aug 26, 2022

you don't think node C (new node with 20(?) peers) has a better chance to accept node A/B's certificate if every certificate has all 800 CertifyMessage?

i think it has no chances to accept certificate if it is using different state for eligibility distribution

the way i understand it is lets say there is some state X (which is prepared based on atxs X1, X2, X3)
if both node A and node B are using state X for eligibility - they will both compute the same eligibility distribution for certifiers. That certificate can be composed from different signatures but it is still the same certificate.

the eligibility distribution will change completely if node C is using different set of atxs (lets say from state X plus atx Y1). so all eligibilities that were computed using state X will be invalid if they are verified using state Y.

@countvonzero
Copy link
Contributor Author

you don't think node C (new node with 20(?) peers) has a better chance to accept node A/B's certificate if every certificate has all 800 CertifyMessage?

i think it has no chances to accept certificate if it is using different state for eligibility distribution

the way i understand it is lets say there is some state X (which is prepared based on atxs X1, X2, X3) if both node A and node B are using state X for eligibility - they will both compute the same eligibility distribution for certifiers. That certificate can be composed from different signatures but it is still the same certificate.

the eligibility distribution will change completely if node C is using different set of atxs (lets say from state X plus atx Y1). so all eligibilities that were computed using state X will be invalid if they are verified using state Y.

thanks for the explanation. i'll revert back to the old behavior.
and will file an issue to record the weight differences btwn nodes and debug/resolve after feature complete.

@countvonzero countvonzero changed the title Accept early certify message certifier: accept early certify message Aug 26, 2022
@countvonzero
Copy link
Contributor Author

countvonzero commented Aug 26, 2022

  • buffer is unnecessary. eligibility can be verified when received and saved into map[Layer][BlockID]

agreed

  • it is pretty clear that we can't just keep them, as they are not validated when received. so anybody can spam node with fake certificates until it blows up. i think syncer should not be asking for certificates until it downloaded required ballots for beacon

note that msgs stored in buffer ARE validated before storing into buffer. but as you stated above. changing the data structure to map[Layer][BlockID] is cleaner

@countvonzero countvonzero force-pushed the accept-early-certify-message branch from cb7a4f9 to 0611720 Compare August 27, 2022 01:44
@countvonzero countvonzero requested a review from fasmat as a code owner August 27, 2022 01:44
@countvonzero countvonzero force-pushed the accept-early-certify-message branch 2 times, most recently from c3a156b to 97f0f92 Compare August 27, 2022 23:46
@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Aug 28, 2022
@bors
Copy link

bors bot commented Aug 28, 2022

try

Build failed:

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Aug 28, 2022
@countvonzero
Copy link
Contributor Author

@dshulyak please take another look. thank you.

added the following change

  • certifier accepts early message in the same data structure and generates certificate as early as possible
  • syncer fetch layer data and opinions separately
  • change TestFailedNodes to get layer hashes multiple times. for the hashes of earlier layers, nodes may change its mind about which block to apply after tortoise verifies.
  • change vm to return empty hash when no layers were applied. this was causing nodes to get in a crash loop during TestFailedNodes when mesh is trying to revert to last applied layer 8 (which was empty)
  • change new join nodes to wait for the next epoch to publish its first ATX

@bors
Copy link

bors bot commented Aug 28, 2022

try

Build succeeded:

case <-b.layerClock.AwaitLayer(waitTill):
// this estimate work if the majority of the nodes use poet servers that are configured the same way.
// TODO: do better when nodes use different poet services
timer := time.NewTimer(time.Second * 35)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment doesn't make sense.
are you waiting for next epoch or 35 seconds after the epoch? what is special about those 35 seconds?

@@ -56,6 +56,9 @@ func testnet() config.Config {
conf.POST.MaxNumUnits = 4
conf.POST.MinNumUnits = 2

conf.POET.CycleGap = 10 * time.Second
conf.POET.PhaseShift = 20 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better to remove poet config from this pr completely, it will only cause confusion since it is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was gonna use this in the wait computation and forgot (and used 35 instead...)

mesh/mesh.go Outdated
@@ -410,7 +397,7 @@ func (msh *Mesh) ProcessLayer(ctx context.Context, layerID types.LayerID) error

if !to.Before(from) {
if err := msh.pushLayersToState(ctx, from, to, newVerified); err != nil {
logger.With().Error("failed to push layers to state", log.Err(err))
logger.With().Warning("failed to push layers to state", log.Err(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why warning? this is not expected event in any circumstances

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is expected when a certificate is not available. we cannot find a block to apply in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed back to error. it is an error for online nodes.

@@ -50,6 +50,9 @@ func fastnet() config.Config {
conf.POST.MaxNumUnits = 4
conf.POST.MinNumUnits = 2

conf.POET.CycleGap = 10 * time.Second
conf.POET.PhaseShift = 20 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they should be set in systest if you want to set them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

c.clients = append(c.clients, clients...)
c.smeshers = len(clients)
c.clients = append(c.clients, smeshers...)
c.smeshers = len(smeshers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think less prone to bugs would be to avoid append on smeshers slice, e.g

c.clients = append(..., bootnodes...)
c.clients = append(..., smeshers...)
c.clients = append(...., clients...)
c.smeshers += len(clients)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -279,6 +293,8 @@ func (b *Builder) run(ctx context.Context) {
return
}

b.waitForFirstATX(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't be in initializer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's in run(). not the initializer

}
}
b.log.WithContext(ctx).With().Info("ready to build first atx",
log.Stringer("current_layer", currentLayer))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentLayer reports old layer, that was computed before waiting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

case <-b.layerClock.AwaitLayer(waitTill):
// this estimate work if the majority of the nodes use poet servers that are configured the same way.
// TODO: do better when nodes use different poet services
timer := time.NewTimer(time.Second * 35)
Copy link
Contributor

@dshulyak dshulyak Aug 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i looked up logs and see why you set 35, but it doesn't make any sense. something is wrong with time/layer computation in go-spacemesh

atx's that target next epoch are published at epoch_start + phase_shift - cycle_gap. so after first layer of the epoch started you need to wait phase_shift - cycle_gap + expected broadcast duration (for fastnet we can set it to 5s)

UPD:
clock is 1-based so if genesis is at 03:01 - first layer of the first epoch will be ticked at 03:01
so according to the clock first layer of the 4th epoch should be layer 17
but the computation in https://github.com/spacemeshos/go-spacemesh/blob/develop/common/types/activation.go#L41
is 0-based so first layer of the 4th epoch is layer 16

we will probably have to change the clock to be 0-based, otherwise there will be a mess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes. for some reason i didn't update this to use the values set in config. the 35 sec was done for systest testing :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will file an issue for the clock for now and set the config values in systest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dshulyak dshulyak Aug 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to wait longer if that clock is not fixed. if you check miner smesher-27 logs {namespace="test-piwv", pod="smesher-27-0"} |= "atxBuilder" - the first atx is still messed up:

{"L":"INFO","T":"2022-08-28T10:07:52.828-0400","N":"9272a.atxBuilder   ","M":"new atx challenge is ready","node_id":"9272a5bf98750e7d5009db903e1b872d80077c1b1f7db2870cc24bd2e527ceaa","module":"atxBuilder","sessionId":"b52ebb53-04c3-411b-a903-2d98b06f3f57","target_epoch":"4","current_epoch":"4","name":"atxBuilder"}

and then it fixes atx in the next epoch:

{"L":"INFO","T":"2022-08-28T10:09:16.057-0400","N":"9272a.atxBuilder   ","M":"new atx challenge is ready","node_id":"9272a5bf98750e7d5009db903e1b872d80077c1b1f7db2870cc24bd2e527ceaa","module":"atxBuilder","sessionId":"3ee492d6-a97f-4120-9c5d-88ba224181d6","target_epoch":"6","current_epoch":"5","name":"atxBuilder"}

to fix this issue either grace period needs to be adjusted by one more layer duration (so total will be 20s for fastnet), or better waitTill should be equal to nextEpoch.FirstLayer().Add(1)


on the other hand if it doesn't break the test - it can be fixed in #3508

if errors.Is(err, sql.ErrNotFound) {
return types.Hash32{}, nil
}
return root, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is correct. i would rather pass empty layer to vm and let it persist empty (or previous if we will use cumulative) hash. but no reason to do it in this pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added todo to rethink this. it currently can cause the TestFailedNodes to stall and timeout the CI test

@countvonzero
Copy link
Contributor Author

bors try

@countvonzero
Copy link
Contributor Author

the code stops watchLayers() when confirmed >= stopSkew. but then require that stopSkew <= confirmed
doesn't that essentially mean that it requires confirmed == stopSkew?

i think it was in case if layers are not confirmed one by one, so that test is fine as long as confirmed advanced atleast to stopSkew level before stopTest layer is reached.

k. i'll work on re-enabling these two tests in followup PRs

@bors
Copy link

bors bot commented Aug 29, 2022

try

Build failed:

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Aug 29, 2022
@bors
Copy link

bors bot commented Aug 29, 2022

try

Build failed:

@countvonzero
Copy link
Contributor Author

bors merge

bors bot pushed a commit that referenced this pull request Aug 29, 2022
## Motivation
<!-- Please mention the issue fixed by this PR or detailed motivation -->
Closes #3475

<!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged -->

## Changes
<!-- Please describe in detail the changes made -->
- block certifier accept early CertifyMessage
- syncer: 
  - separate fetching layer data and layer opinions (only certificate for now)
  - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer
- fix sql bug where certificate is retrieved correctly and not overwritten once set
- systest
  - fixed systests TestAddNode so that the test fails with the certificate sync failures
  - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
@bors
Copy link

bors bot commented Aug 29, 2022

Build failed:

@countvonzero
Copy link
Contributor Author

bors merge

bors bot pushed a commit that referenced this pull request Aug 29, 2022
## Motivation
<!-- Please mention the issue fixed by this PR or detailed motivation -->
Closes #3475

<!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged -->

## Changes
<!-- Please describe in detail the changes made -->
- block certifier accept early CertifyMessage
- syncer: 
  - separate fetching layer data and layer opinions (only certificate for now)
  - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer
- fix sql bug where certificate is retrieved correctly and not overwritten once set
- systest
  - fixed systests TestAddNode so that the test fails with the certificate sync failures
  - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
@bors
Copy link

bors bot commented Aug 29, 2022

Build failed:

@countvonzero
Copy link
Contributor Author

bors merge

bors bot pushed a commit that referenced this pull request Aug 29, 2022
## Motivation
<!-- Please mention the issue fixed by this PR or detailed motivation -->
Closes #3475

<!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged -->

## Changes
<!-- Please describe in detail the changes made -->
- block certifier accept early CertifyMessage
- syncer: 
  - separate fetching layer data and layer opinions (only certificate for now)
  - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer
- fix sql bug where certificate is retrieved correctly and not overwritten once set
- systest
  - fixed systests TestAddNode so that the test fails with the certificate sync failures
  - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
@bors
Copy link

bors bot commented Aug 29, 2022

Build failed:

@countvonzero
Copy link
Contributor Author

bors merge

bors bot pushed a commit that referenced this pull request Aug 29, 2022
## Motivation
<!-- Please mention the issue fixed by this PR or detailed motivation -->
Closes #3475

<!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged -->

## Changes
<!-- Please describe in detail the changes made -->
- block certifier accept early CertifyMessage
- syncer: 
  - separate fetching layer data and layer opinions (only certificate for now)
  - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer
- fix sql bug where certificate is retrieved correctly and not overwritten once set
- systest
  - fixed systests TestAddNode so that the test fails with the certificate sync failures
  - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
@bors
Copy link

bors bot commented Aug 29, 2022

Build failed:

@countvonzero
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request Aug 29, 2022
@bors
Copy link

bors bot commented Aug 29, 2022

try

Build succeeded:

@countvonzero
Copy link
Contributor Author

bors merge

bors bot pushed a commit that referenced this pull request Aug 29, 2022
## Motivation
<!-- Please mention the issue fixed by this PR or detailed motivation -->
Closes #3475

<!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged -->

## Changes
<!-- Please describe in detail the changes made -->
- block certifier accept early CertifyMessage
- syncer: 
  - separate fetching layer data and layer opinions (only certificate for now)
  - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer
- fix sql bug where certificate is retrieved correctly and not overwritten once set
- systest
  - fixed systests TestAddNode so that the test fails with the certificate sync failures
  - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
@bors
Copy link

bors bot commented Aug 29, 2022

Build failed:

@countvonzero
Copy link
Contributor Author

bors merge

bors bot pushed a commit that referenced this pull request Aug 29, 2022
## Motivation
<!-- Please mention the issue fixed by this PR or detailed motivation -->
Closes #3475

<!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged -->

## Changes
<!-- Please describe in detail the changes made -->
- block certifier accept early CertifyMessage
- syncer: 
  - separate fetching layer data and layer opinions (only certificate for now)
  - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer
- fix sql bug where certificate is retrieved correctly and not overwritten once set
- systest
  - fixed systests TestAddNode so that the test fails with the certificate sync failures
  - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
@bors
Copy link

bors bot commented Aug 29, 2022

Pull request successfully merged into develop.

Build succeeded:

@bors bors bot changed the title certifier: accept early certify message [Merged by Bors] - certifier: accept early certify message Aug 29, 2022
@bors bors bot closed this Aug 29, 2022
@bors bors bot deleted the accept-early-certify-message branch August 29, 2022 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hare certification: sync flow is not working yet tests passed
2 participants