hare loads incorrect (tortoise) active set during sync #4552

dshulyak · 2023-06-20T06:54:59Z

i noticed that during sync layers are not processed immediately, quite often they get stuck for unclear reason, sync during this time downloads more layers and forces hare to cache active set which would be incorrect later.

i think that current implementation is not robust, and should be based on event notifications, so that a particular timing of calls won't have an impact on the end result. in this particular case we should add OnAppliedBlock to hare oracle so that it can update local cache if block was applied at a later point in time

dshulyak · 2023-06-20T07:49:55Z

for example

93516:2023-06-20T09:37:06.544+0200 WARN 9eba1.hareOracle using tortoise active set {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "hareOracle", "epoch": 9, "beacon": "0x885733e6", "name": "hareOracle"}

but applied layer is very far from that

95738:2023-06-20T09:41:21.415+0200 INFO 9eba1.executor executed block {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "executor", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "layer_id": 3877, "block_id": "d07049d448", "event": true, "block": "d07049d448", "state_hash": "0x6b14d8457e5a39878962adc0d77801a5268508207387e0e752110a1b2a758f09", "duration": "1.600513ms", "count": 0, "name": "executor"}
95741:2023-06-20T09:41:21.774+0200 INFO 9eba1.mesh consensus results {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "mesh", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "layer_id": 3878, "results": [{"layer": 3877, "opinion": "b266726448", "verified": true, "blocks": [{"id": "d07049d448", "layer": 3877, "height": 37847, "valid": true, "invalid": false, "hare": true, "data": true, "local": true}]}, {"layer": 3878, "opinion": "8921b07215", "verified": false, "blocks": []}], "name": "mesh"}

dshulyak · 2023-06-20T07:52:13Z

also every time when consensus is stuck i see a log like

95742:2023-06-20T09:41:22.421+0200 INFO 9eba1.sync begin fork finding with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "diff_layer": "3878", "diff_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420", "peer": "12D3KooWPhJ5BwDiZCVtH4Yp7pCz25ym7JypZrPCxZCqMc5jAMmx", "name": "sync"}
95743:2023-06-20T09:41:23.552+0200 INFO 9eba1.sync found hash fork with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "diff_layer": "3878", "diff_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420", "peer": "12D3KooWPhJ5BwDiZCVtH4Yp7pCz25ym7JypZrPCxZCqMc5jAMmx", "boundary": {"from": 3867, "from_hash": "0x8c34ab746836fb75ed3361b0f10018ed521437ed77cd764e80a6da8f18b27953", "to": 3878, "to_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420"}, "num_reqs": 2, "fork": "3877", "fork_hash": "0xb266726448b6ef06b62ccf29ef91d8ec6cbeff44cec8dd541d7301d79f11b286", "after_fork": "3878", "after_fork_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420", "name": "sync"}

dshulyak · 2023-06-20T08:32:08Z

@countvonzero i think thats because fork finder blocks processing layers, and we continue to download certificates/ballots/etc in the meantime

dshulyak · 2023-06-20T15:42:00Z

we probably should disable fork finder when peer is doing initial sync. current behavior fails hare oracle for the rest of the epoch, unless the node is restarted

countvonzero · 2023-06-20T17:04:00Z

also every time when consensus is stuck i see a log like

we are talking about a newly syncing node, right? when you say consensus is stuck do you mean tortoise verified layer is stuck?

we probably should disable fork finder when peer is doing initial sync.

agreed. resolution during sync creates a lot of busy work.

we should add OnAppliedBlock to hare oracle so that it can update local cache if block was applied at a later point in time

but do you think we should not fallback to tortoise active set in syncing mode? the only place that's needed is to verify a block certificate. i don't think falling back to tortoise active set is the right choice here.

dshulyak · 2023-06-20T17:06:43Z

we are talking about a newly syncing node, right? when you say consensus is stuck do you mean tortoise verified layer is stuck?

yes. i mean that consensus results are not updated, as tortoise doesn't receive call to TallyVotes

but do you think we should not fallback to tortoise active set in syncing mode? the only place that's needed is to verify a block certificate. i don't think falling back to tortoise active set is the right choice here.

i think the problem is that we compute this fallback before tortoise verified layers. so we ask for activeset for layer 1000, while tortoise is still waiting for TallyVotes on layer 10. so naturally there are no blocks that were applied

countvonzero · 2023-06-20T18:14:16Z

i think the problem is that we compute this fallback before tortoise verified layers. so we ask for activeset for layer 1000, while tortoise is still waiting for TallyVotes on layer 10. so naturally there are no blocks that were applied

something doesn't sound right to me still. i think i am missing the codepath where this discrepancy would happen.

in state_syncer.go, the logic is

for every layer [last layer in state (M), last (ballot) synced layer (N)]

sync opinions (block certificates)
sync cert block and validate the certificate - this is when the hare active set calculated / cached
call mesh.ProcessLayer that query tortoise updates and cause block to be executed

why would tortoise be waiting for TallyVotes on layer 10 still? tortoise should have all the ballots up to layer N already?

dshulyak · 2023-06-20T18:22:36Z

because the thread that calls ProcessLayer gets into mesh agreement with peers and doesn't call TallyVotes timely

2023-06-20T09:58:00.104+0200 INFO 9eba1.sync begin fork finding with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "0ad035b1-c03b-4cc1-8a36-003b4a45031e", "diff_layer": "3876", "diff_hash": "0xa36b2bc89cde9083fab41bdc5231671fbb6cae0ea74d0f3b62e5fcf42de13e76", "peer": "12D3KooWA96fvi2pQaQHpS2cpahDJCyQSsBW6A4KwHCPXsDVuiEJ", "name": "sync"}
2023-06-20T09:58:00.210+0200 INFO 9eba1.sync found hash fork with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "0ad035b1-c03b-4cc1-8a36-003b4a45031e", "diff_layer": "3876", "diff_hash": "0xa36b2bc89cde9083fab41bdc5231671fbb6cae0ea74d0f3b62e5fcf42de13e76", "peer": "12D3KooWA96fvi2pQaQHpS2cpahDJCyQSsBW6A4KwHCPXsDVuiEJ", "boundary": {"from": 3671, "from_hash": "0x86b9735e87a2bfa3bd2e7c111271389867e594e562e7fa70e491aeb4efa3e5f1", "to": 3699, "to_hash": "0xc6fb96efe15b61e9735fc3740412ed38859dd0e1fc2c3c82165379091719d4e0"}, "num_reqs": 2, "fork": "3671", "fork_hash": "0x86b9735e87a2bfa3bd2e7c111271389867e594e562e7fa70e491aeb4efa3e5f1", "after_fork": "3672", "after_fork_hash": "0x5a4a50b71ff4a8e1fb319df5a2228b55fe8ae4861689acc025fcaaa906524ada", "name": "sync"}

countvonzero · 2023-06-20T18:30:00Z

ok. thanks. let me first change to disable the hash resolution during sync.

dshulyak · 2023-06-21T05:04:09Z

with this change sync and consensus result progress without halting, so maybe downloading layers is simply faster then processing them?

101029:2023-06-21T06:55:27.961+0200 INFO 9eba1.sync sync state change {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "f7166574-942d-44a6-b20a-f1e46fa95900", "from state": "notSynced", "to state": "gossipSync", "current": "5867", "last synced": "5866", "latest": "3420", "processed": "3420", "name": "sync"}

101879:2023-06-21T06:57:06.703+0200 WARN 9eba1.hareOracle using tortoise active set {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "hareOracle", "epoch": 10, "beacon": "0x2fbac248", "name": "hareOracle"}

countvonzero · 2023-06-21T13:37:09Z

with this change sync and consensus result progress without halting, so maybe downloading layers is simply faster then processing them?

if memory serve that's true before we were also downloading blocks.

## Motivation part of #4552 ## Changes do not engage in hash resolution when node hasn't process all layers during sync

countvonzero · 2023-06-21T23:11:04Z

@dshulyak double-checking the logic for OnAppliedBlock.

there is a for a node to change hare active set after it start participating in hare.
if there was a revert, then the network will start to disagree on hare active set.
i think it's better to stick to the first (and possibly wrong) hare active set instead of changing to the correct one.

dshulyak · 2023-06-22T05:31:10Z

i think it's better to stick to the first (and possibly wrong) hare active set instead of changing to the correct one.

everyone should have the same active set. node that restarts will select correct activeset from the block, and one that doesn't will stick to possibly incorrect tortoise active set. there should be always codepath that leads to the same result regardless of when method was called, current approach is simply not robust

but fixing that also doesn't solve whole problem

sync can't download and verify certificates while processed layer is so far behind. it will fail all of them if tortoise active set is not equal to the active set that was used to sign those certificates

countvonzero · 2023-08-09T10:04:50Z

thinking about options:

A. for hare 3 we dont need agreement on hare active set(?). so maybe this issue doesn't matter anymore
B. the only codepath for sync to populate hare active set is to validate block certificate.
- B.1 we don't sync block certificate at all and let tortoise figure out the consensus during catch up sync
- B.2 we explicitly clear hare active set right before catch-up sync is complete

dshulyak · 2023-08-09T15:13:32Z

for hare 3 we dont need agreement on hare active set(?). so maybe this issue doesn't matter anymore

i think so to. i don't how fast this part will be implemented though, there is also a dependency to sync grades, or do something else about them. and we still want to test it well before switching

the only codepath for sync to populate hare active set is to validate block certificate.

what active set we will use for validating certificate? if it not the same activeset that was used to create it then cert validation will likely fail . but beside that, B2 doesn't seem like a complex change?

i think we can wait with this, if it will be clear that hare3 with out of consensus active set won't lend in next 4 weeks then B2 makes sense

dshulyak · 2023-08-30T11:07:50Z

B.2 we explicitly clear hare active set right before catch-up sync is complete

@countvonzero so this is ok option? i already see it in the network, we are just lucky that first block activeset doesn't differ from tortoise (atleast on my node)

dshulyak added the milestone/genesis label Jun 20, 2023

countvonzero mentioned this issue Jun 21, 2023

[Merged by Bors] - syncer: skip hash resolution bore node processed all layers #4557

Closed

bors bot pushed a commit that referenced this issue Jun 21, 2023

syncer: skip hash resolution bore node processed all layers (#4557)

bcd5d2d

## Motivation part of #4552 ## Changes do not engage in hash resolution when node hasn't process all layers during sync

bors bot pushed a commit that referenced this issue Jun 21, 2023

syncer: skip hash resolution bore node processed all layers (#4557)

443e1ca

## Motivation part of #4552 ## Changes do not engage in hash resolution when node hasn't process all layers during sync

dshulyak added milestone/first block and removed milestone/genesis labels Jul 17, 2023

dshulyak added area/hare and removed milestone/first block labels Aug 28, 2023

dshulyak mentioned this issue Aug 31, 2023

[Merged by Bors] - hare/oracle: reset oracle when node enters synced state #4936

Closed

bors bot closed this as completed in 1a69f51 Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hare loads incorrect (tortoise) active set during sync #4552

hare loads incorrect (tortoise) active set during sync #4552

dshulyak commented Jun 20, 2023 •

edited

dshulyak commented Jun 20, 2023

dshulyak commented Jun 20, 2023

dshulyak commented Jun 20, 2023 •

edited

dshulyak commented Jun 20, 2023

countvonzero commented Jun 20, 2023

dshulyak commented Jun 20, 2023 •

edited

countvonzero commented Jun 20, 2023

dshulyak commented Jun 20, 2023 •

edited

countvonzero commented Jun 20, 2023

dshulyak commented Jun 21, 2023

countvonzero commented Jun 21, 2023

countvonzero commented Jun 21, 2023

dshulyak commented Jun 22, 2023 •

edited

countvonzero commented Aug 9, 2023

dshulyak commented Aug 9, 2023

dshulyak commented Aug 30, 2023

hare loads incorrect (tortoise) active set during sync #4552

hare loads incorrect (tortoise) active set during sync #4552

Comments

dshulyak commented Jun 20, 2023 • edited

dshulyak commented Jun 20, 2023

dshulyak commented Jun 20, 2023

dshulyak commented Jun 20, 2023 • edited

dshulyak commented Jun 20, 2023

countvonzero commented Jun 20, 2023

dshulyak commented Jun 20, 2023 • edited

countvonzero commented Jun 20, 2023

dshulyak commented Jun 20, 2023 • edited

countvonzero commented Jun 20, 2023

dshulyak commented Jun 21, 2023

countvonzero commented Jun 21, 2023

countvonzero commented Jun 21, 2023

dshulyak commented Jun 22, 2023 • edited

countvonzero commented Aug 9, 2023

dshulyak commented Aug 9, 2023

dshulyak commented Aug 30, 2023

dshulyak commented Jun 20, 2023 •

edited

dshulyak commented Jun 20, 2023 •

edited

dshulyak commented Jun 20, 2023 •

edited

dshulyak commented Jun 20, 2023 •

edited

dshulyak commented Jun 22, 2023 •

edited