-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hare loads incorrect (tortoise) active set during sync #4552
Comments
for example
but applied layer is very far from that
|
also every time when consensus is stuck i see a log like
|
@countvonzero i think thats because fork finder blocks processing layers, and we continue to download certificates/ballots/etc in the meantime |
we probably should disable fork finder when peer is doing initial sync. current behavior fails hare oracle for the rest of the epoch, unless the node is restarted |
we are talking about a newly syncing node, right? when you say
agreed. resolution during sync creates a lot of busy work.
but do you think we should not fallback to tortoise active set in syncing mode? the only place that's needed is to verify a block certificate. i don't think falling back to tortoise active set is the right choice here. |
yes. i mean that consensus results are not updated, as tortoise doesn't receive call to TallyVotes
i think the problem is that we compute this fallback before tortoise verified layers. so we ask for activeset for layer 1000, while tortoise is still waiting for TallyVotes on layer 10. so naturally there are no blocks that were applied |
something doesn't sound right to me still. i think i am missing the codepath where this discrepancy would happen. in state_syncer.go, the logic is for every layer [last layer in state (M), last (ballot) synced layer (N)]
why would tortoise be waiting for TallyVotes on layer 10 still? tortoise should have all the ballots up to layer N already? |
because the thread that calls ProcessLayer gets into mesh agreement with peers and doesn't call TallyVotes timely
|
ok. thanks. let me first change to disable the hash resolution during sync. |
with this change sync and consensus result progress without halting, so maybe downloading layers is simply faster then processing them?
|
if memory serve that's true before we were also downloading blocks. |
## Motivation part of #4552 ## Changes do not engage in hash resolution when node hasn't process all layers during sync
## Motivation part of #4552 ## Changes do not engage in hash resolution when node hasn't process all layers during sync
@dshulyak double-checking the logic for there is a for a node to change hare active set after it start participating in hare. |
everyone should have the same active set. node that restarts will select correct activeset from the block, and one that doesn't will stick to possibly incorrect tortoise active set. there should be always codepath that leads to the same result regardless of when method was called, current approach is simply not robust but fixing that also doesn't solve whole problem sync can't download and verify certificates while processed layer is so far behind. it will fail all of them if tortoise active set is not equal to the active set that was used to sign those certificates |
thinking about options:
|
i think so to. i don't how fast this part will be implemented though, there is also a dependency to sync grades, or do something else about them. and we still want to test it well before switching
what active set we will use for validating certificate? if it not the same activeset that was used to create it then cert validation will likely fail . but beside that, B2 doesn't seem like a complex change? i think we can wait with this, if it will be clear that hare3 with out of consensus active set won't lend in next 4 weeks then B2 makes sense |
@countvonzero so this is ok option? i already see it in the network, we are just lucky that first block activeset doesn't differ from tortoise (atleast on my node) |
i noticed that during sync layers are not processed immediately, quite often they get stuck for unclear reason, sync during this time downloads more layers and forces hare to cache active set which would be incorrect later.
i think that current implementation is not robust, and should be based on event notifications, so that a particular timing of calls won't have an impact on the end result. in this particular case we should add OnAppliedBlock to hare oracle so that it can update local cache if block was applied at a later point in time
The text was updated successfully, but these errors were encountered: