-
Notifications
You must be signed in to change notification settings - Fork 220
[Merged by Bors] - certifier: accept early certify message #3497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
but that is not relevant if someone disagrees. cert message is either valid according to your view or invalid, so once there are valid msgs with total weight > 400, whats the point to keep adding them? i also think that it creates problems for p2p network, since anybody can broadcast invalid data and every peer will re-broadcast it. the only good option is to ignore certificates that are invalid according to your view |
eligibility can be validated even if message is early |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffer
is unnecessary. eligibility can be verified when received and saved into map[Layer][BlockID]- i don't think that it is a good option to store all signatures. i also don't understand whats the problem with doing it in a clean way. i would rather output certificate in one of two events:
- register is called and we already have a certificate
- otherwise when required units were received after register
- it is pretty clear that we can't just keep them, as they are not validated when received. so anybody can spam node with fake certificates until it blows up. i think syncer should not be asking for certificates until it downloaded required ballots for beacon
mesh/mesh.go
Outdated
// a synced certificate cannot be verified before the node learns the beacon value | ||
// of the epoch. any certificate passed on by the syncer will stay here until | ||
// beacon is available. | ||
certCache map[types.LayerID][]*types.Certificate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think that it is ok to add more mutable state randomly across the code. is there a way to refactor it in a better way?
can syncer download ballots and blocks before downloading opinions on them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. i modified syncer to fetch layer data and layer opinions separately.
the layer opinion is part of the state sync, after beacon is available.
so during TestAddNodes i found that a new node that joined the network always have higher total weight than the online nodes, say there are 800 msgs in the network (assuming every miner has eligibility 1) |
i think it has no chances to accept certificate if it is using different state for eligibility distribution the way i understand it is lets say there is some state X (which is prepared based on atxs X1, X2, X3) the eligibility distribution will change completely if node C is using different set of atxs (lets say from state X plus atx Y1). so all eligibilities that were computed using state X will be invalid if they are verified using state Y. |
thanks for the explanation. i'll revert back to the old behavior. |
agreed
note that msgs stored in buffer ARE validated before storing into buffer. but as you stated above. changing the data structure to map[Layer][BlockID] is cleaner |
cb7a4f9
to
0611720
Compare
c3a156b
to
97f0f92
Compare
bors try |
tryBuild failed: |
bors try |
@dshulyak please take another look. thank you. added the following change
|
tryBuild succeeded: |
activation/activation.go
Outdated
case <-b.layerClock.AwaitLayer(waitTill): | ||
// this estimate work if the majority of the nodes use poet servers that are configured the same way. | ||
// TODO: do better when nodes use different poet services | ||
timer := time.NewTimer(time.Second * 35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment doesn't make sense.
are you waiting for next epoch or 35 seconds after the epoch? what is special about those 35 seconds?
config/presets/testnet.go
Outdated
@@ -56,6 +56,9 @@ func testnet() config.Config { | |||
conf.POST.MaxNumUnits = 4 | |||
conf.POST.MinNumUnits = 2 | |||
|
|||
conf.POET.CycleGap = 10 * time.Second | |||
conf.POET.PhaseShift = 20 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is better to remove poet config from this pr completely, it will only cause confusion since it is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was gonna use this in the wait computation and forgot (and used 35 instead...)
mesh/mesh.go
Outdated
@@ -410,7 +397,7 @@ func (msh *Mesh) ProcessLayer(ctx context.Context, layerID types.LayerID) error | |||
|
|||
if !to.Before(from) { | |||
if err := msh.pushLayersToState(ctx, from, to, newVerified); err != nil { | |||
logger.With().Error("failed to push layers to state", log.Err(err)) | |||
logger.With().Warning("failed to push layers to state", log.Err(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why warning? this is not expected event in any circumstances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is expected when a certificate is not available. we cannot find a block to apply in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed back to error. it is an error for online nodes.
config/presets/fastnet.go
Outdated
@@ -50,6 +50,9 @@ func fastnet() config.Config { | |||
conf.POST.MaxNumUnits = 4 | |||
conf.POST.MinNumUnits = 2 | |||
|
|||
conf.POET.CycleGap = 10 * time.Second | |||
conf.POET.PhaseShift = 20 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they should be set in systest if you want to set them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
systest/cluster/cluster.go
Outdated
c.clients = append(c.clients, clients...) | ||
c.smeshers = len(clients) | ||
c.clients = append(c.clients, smeshers...) | ||
c.smeshers = len(smeshers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think less prone to bugs would be to avoid append on smeshers slice, e.g
c.clients = append(..., bootnodes...)
c.clients = append(..., smeshers...)
c.clients = append(...., clients...)
c.smeshers += len(clients)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
activation/activation.go
Outdated
@@ -279,6 +293,8 @@ func (b *Builder) run(ctx context.Context) { | |||
return | |||
} | |||
|
|||
b.waitForFirstATX(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't be in initializer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in run()
. not the initializer
activation/activation.go
Outdated
} | ||
} | ||
b.log.WithContext(ctx).With().Info("ready to build first atx", | ||
log.Stringer("current_layer", currentLayer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentLayer reports old layer, that was computed before waiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
activation/activation.go
Outdated
case <-b.layerClock.AwaitLayer(waitTill): | ||
// this estimate work if the majority of the nodes use poet servers that are configured the same way. | ||
// TODO: do better when nodes use different poet services | ||
timer := time.NewTimer(time.Second * 35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i looked up logs and see why you set 35, but it doesn't make any sense. something is wrong with time/layer computation in go-spacemesh
atx's that target next epoch are published at epoch_start + phase_shift - cycle_gap. so after first layer of the epoch started you need to wait phase_shift - cycle_gap + expected broadcast duration (for fastnet we can set it to 5s)
UPD:
clock is 1-based so if genesis is at 03:01 - first layer of the first epoch will be ticked at 03:01
so according to the clock first layer of the 4th epoch should be layer 17
but the computation in https://github.com/spacemeshos/go-spacemesh/blob/develop/common/types/activation.go#L41
is 0-based so first layer of the 4th epoch is layer 16
we will probably have to change the clock to be 0-based, otherwise there will be a mess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes. for some reason i didn't update this to use the values set in config. the 35 sec was done for systest testing :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will file an issue for the clock for now and set the config values in systest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to wait longer if that clock is not fixed. if you check miner smesher-27 logs {namespace="test-piwv", pod="smesher-27-0"} |= "atxBuilder"
- the first atx is still messed up:
{"L":"INFO","T":"2022-08-28T10:07:52.828-0400","N":"9272a.atxBuilder ","M":"new atx challenge is ready","node_id":"9272a5bf98750e7d5009db903e1b872d80077c1b1f7db2870cc24bd2e527ceaa","module":"atxBuilder","sessionId":"b52ebb53-04c3-411b-a903-2d98b06f3f57","target_epoch":"4","current_epoch":"4","name":"atxBuilder"}
and then it fixes atx in the next epoch:
{"L":"INFO","T":"2022-08-28T10:09:16.057-0400","N":"9272a.atxBuilder ","M":"new atx challenge is ready","node_id":"9272a5bf98750e7d5009db903e1b872d80077c1b1f7db2870cc24bd2e527ceaa","module":"atxBuilder","sessionId":"3ee492d6-a97f-4120-9c5d-88ba224181d6","target_epoch":"6","current_epoch":"5","name":"atxBuilder"}
to fix this issue either grace period needs to be adjusted by one more layer duration (so total will be 20s for fastnet), or better waitTill should be equal to nextEpoch.FirstLayer().Add(1)
on the other hand if it doesn't break the test - it can be fixed in #3508
if errors.Is(err, sql.ErrNotFound) { | ||
return types.Hash32{}, nil | ||
} | ||
return root, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this is correct. i would rather pass empty layer to vm and let it persist empty (or previous if we will use cumulative) hash. but no reason to do it in this pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added todo to rethink this. it currently can cause the TestFailedNodes to stall and timeout the CI test
bors try |
k. i'll work on re-enabling these two tests in followup PRs |
tryBuild failed: |
bors try |
tryBuild failed: |
bors merge |
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Closes #3475 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - block certifier accept early CertifyMessage - syncer: - separate fetching layer data and layer opinions (only certificate for now) - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer - fix sql bug where certificate is retrieved correctly and not overwritten once set - systest - fixed systests TestAddNode so that the test fails with the certificate sync failures - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
Build failed: |
bors merge |
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Closes #3475 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - block certifier accept early CertifyMessage - syncer: - separate fetching layer data and layer opinions (only certificate for now) - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer - fix sql bug where certificate is retrieved correctly and not overwritten once set - systest - fixed systests TestAddNode so that the test fails with the certificate sync failures - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
Build failed: |
bors merge |
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Closes #3475 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - block certifier accept early CertifyMessage - syncer: - separate fetching layer data and layer opinions (only certificate for now) - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer - fix sql bug where certificate is retrieved correctly and not overwritten once set - systest - fixed systests TestAddNode so that the test fails with the certificate sync failures - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
Build failed: |
bors merge |
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Closes #3475 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - block certifier accept early CertifyMessage - syncer: - separate fetching layer data and layer opinions (only certificate for now) - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer - fix sql bug where certificate is retrieved correctly and not overwritten once set - systest - fixed systests TestAddNode so that the test fails with the certificate sync failures - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
Build failed: |
bors try |
tryBuild succeeded: |
bors merge |
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Closes #3475 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - block certifier accept early CertifyMessage - syncer: - separate fetching layer data and layer opinions (only certificate for now) - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer - fix sql bug where certificate is retrieved correctly and not overwritten once set - systest - fixed systests TestAddNode so that the test fails with the certificate sync failures - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
Build failed: |
bors merge |
## Motivation <!-- Please mention the issue fixed by this PR or detailed motivation --> Closes #3475 <!-- `Closes #XXXX, closes #XXXX, ...` links mentioned issues to this PR and automatically closes them when this it's merged --> ## Changes <!-- Please describe in detail the changes made --> - block certifier accept early CertifyMessage - syncer: - separate fetching layer data and layer opinions (only certificate for now) - only fetch certificates from peers within hdist layers of processed layer. processed layer is tortoise's last layer - fix sql bug where certificate is retrieved correctly and not overwritten once set - systest - fixed systests TestAddNode so that the test fails with the certificate sync failures - make systests TestFailedNodes to reassess layer hashes from nodes. because blocks are applied optimistically, the applied block can be changed after tortoise verify a layer
Pull request successfully merged into develop. Build succeeded: |
Motivation
Closes #3475
Changes