Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hare loads incorrect (tortoise) active set during sync #4552

Closed
dshulyak opened this issue Jun 20, 2023 · 16 comments
Closed

hare loads incorrect (tortoise) active set during sync #4552

dshulyak opened this issue Jun 20, 2023 · 16 comments

Comments

@dshulyak
Copy link
Contributor

dshulyak commented Jun 20, 2023

i noticed that during sync layers are not processed immediately, quite often they get stuck for unclear reason, sync during this time downloads more layers and forces hare to cache active set which would be incorrect later.

i think that current implementation is not robust, and should be based on event notifications, so that a particular timing of calls won't have an impact on the end result. in this particular case we should add OnAppliedBlock to hare oracle so that it can update local cache if block was applied at a later point in time

@dshulyak
Copy link
Contributor Author

for example

93516:2023-06-20T09:37:06.544+0200 WARN 9eba1.hareOracle using tortoise active set {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "hareOracle", "epoch": 9, "beacon": "0x885733e6", "name": "hareOracle"}

but applied layer is very far from that

95738:2023-06-20T09:41:21.415+0200 INFO 9eba1.executor executed block {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "executor", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "layer_id": 3877, "block_id": "d07049d448", "event": true, "block": "d07049d448", "state_hash": "0x6b14d8457e5a39878962adc0d77801a5268508207387e0e752110a1b2a758f09", "duration": "1.600513ms", "count": 0, "name": "executor"}
95741:2023-06-20T09:41:21.774+0200 INFO 9eba1.mesh consensus results {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "mesh", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "layer_id": 3878, "results": [{"layer": 3877, "opinion": "b266726448", "verified": true, "blocks": [{"id": "d07049d448", "layer": 3877, "height": 37847, "valid": true, "invalid": false, "hare": true, "data": true, "local": true}]}, {"layer": 3878, "opinion": "8921b07215", "verified": false, "blocks": []}], "name": "mesh"}

@dshulyak
Copy link
Contributor Author

also every time when consensus is stuck i see a log like

95742:2023-06-20T09:41:22.421+0200 INFO 9eba1.sync begin fork finding with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "diff_layer": "3878", "diff_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420", "peer": "12D3KooWPhJ5BwDiZCVtH4Yp7pCz25ym7JypZrPCxZCqMc5jAMmx", "name": "sync"}
95743:2023-06-20T09:41:23.552+0200 INFO 9eba1.sync found hash fork with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "8831cbf8-8bd2-4fca-9d2b-8d39cf39ec23", "diff_layer": "3878", "diff_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420", "peer": "12D3KooWPhJ5BwDiZCVtH4Yp7pCz25ym7JypZrPCxZCqMc5jAMmx", "boundary": {"from": 3867, "from_hash": "0x8c34ab746836fb75ed3361b0f10018ed521437ed77cd764e80a6da8f18b27953", "to": 3878, "to_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420"}, "num_reqs": 2, "fork": "3877", "fork_hash": "0xb266726448b6ef06b62ccf29ef91d8ec6cbeff44cec8dd541d7301d79f11b286", "after_fork": "3878", "after_fork_hash": "0x42a5e49d6e3c7ab9a2e06bca0304f4e7c4f832c0f4ccc582b409375f6e54b420", "name": "sync"}

@dshulyak
Copy link
Contributor Author

dshulyak commented Jun 20, 2023

@countvonzero i think thats because fork finder blocks processing layers, and we continue to download certificates/ballots/etc in the meantime

@dshulyak
Copy link
Contributor Author

we probably should disable fork finder when peer is doing initial sync. current behavior fails hare oracle for the rest of the epoch, unless the node is restarted

@countvonzero
Copy link
Contributor

also every time when consensus is stuck i see a log like

we are talking about a newly syncing node, right? when you say consensus is stuck do you mean tortoise verified layer is stuck?

we probably should disable fork finder when peer is doing initial sync.

agreed. resolution during sync creates a lot of busy work.

we should add OnAppliedBlock to hare oracle so that it can update local cache if block was applied at a later point in time

but do you think we should not fallback to tortoise active set in syncing mode? the only place that's needed is to verify a block certificate. i don't think falling back to tortoise active set is the right choice here.

@dshulyak
Copy link
Contributor Author

dshulyak commented Jun 20, 2023

we are talking about a newly syncing node, right? when you say consensus is stuck do you mean tortoise verified layer is stuck?

yes. i mean that consensus results are not updated, as tortoise doesn't receive call to TallyVotes

but do you think we should not fallback to tortoise active set in syncing mode? the only place that's needed is to verify a block certificate. i don't think falling back to tortoise active set is the right choice here.

i think the problem is that we compute this fallback before tortoise verified layers. so we ask for activeset for layer 1000, while tortoise is still waiting for TallyVotes on layer 10. so naturally there are no blocks that were applied

@countvonzero
Copy link
Contributor

i think the problem is that we compute this fallback before tortoise verified layers. so we ask for activeset for layer 1000, while tortoise is still waiting for TallyVotes on layer 10. so naturally there are no blocks that were applied

something doesn't sound right to me still. i think i am missing the codepath where this discrepancy would happen.

in state_syncer.go, the logic is

for every layer [last layer in state (M), last (ballot) synced layer (N)]

  • sync opinions (block certificates)
  • sync cert block and validate the certificate - this is when the hare active set calculated / cached
  • call mesh.ProcessLayer that query tortoise updates and cause block to be executed

why would tortoise be waiting for TallyVotes on layer 10 still? tortoise should have all the ballots up to layer N already?

@dshulyak
Copy link
Contributor Author

dshulyak commented Jun 20, 2023

because the thread that calls ProcessLayer gets into mesh agreement with peers and doesn't call TallyVotes timely

2023-06-20T09:58:00.104+0200 INFO 9eba1.sync begin fork finding with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "0ad035b1-c03b-4cc1-8a36-003b4a45031e", "diff_layer": "3876", "diff_hash": "0xa36b2bc89cde9083fab41bdc5231671fbb6cae0ea74d0f3b62e5fcf42de13e76", "peer": "12D3KooWA96fvi2pQaQHpS2cpahDJCyQSsBW6A4KwHCPXsDVuiEJ", "name": "sync"}
2023-06-20T09:58:00.210+0200 INFO 9eba1.sync found hash fork with peer {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "0ad035b1-c03b-4cc1-8a36-003b4a45031e", "diff_layer": "3876", "diff_hash": "0xa36b2bc89cde9083fab41bdc5231671fbb6cae0ea74d0f3b62e5fcf42de13e76", "peer": "12D3KooWA96fvi2pQaQHpS2cpahDJCyQSsBW6A4KwHCPXsDVuiEJ", "boundary": {"from": 3671, "from_hash": "0x86b9735e87a2bfa3bd2e7c111271389867e594e562e7fa70e491aeb4efa3e5f1", "to": 3699, "to_hash": "0xc6fb96efe15b61e9735fc3740412ed38859dd0e1fc2c3c82165379091719d4e0"}, "num_reqs": 2, "fork": "3671", "fork_hash": "0x86b9735e87a2bfa3bd2e7c111271389867e594e562e7fa70e491aeb4efa3e5f1", "after_fork": "3672", "after_fork_hash": "0x5a4a50b71ff4a8e1fb319df5a2228b55fe8ae4861689acc025fcaaa906524ada", "name": "sync"}

@countvonzero
Copy link
Contributor

ok. thanks. let me first change to disable the hash resolution during sync.

@dshulyak
Copy link
Contributor Author

with this change sync and consensus result progress without halting, so maybe downloading layers is simply faster then processing them?

101029:2023-06-21T06:55:27.961+0200 INFO 9eba1.sync sync state change {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "sync", "sessionId": "f7166574-942d-44a6-b20a-f1e46fa95900", "from state": "notSynced", "to state": "gossipSync", "current": "5867", "last synced": "5866", "latest": "3420", "processed": "3420", "name": "sync"}

101879:2023-06-21T06:57:06.703+0200 WARN 9eba1.hareOracle using tortoise active set {"node_id": "9eba1a2b30d76d51f0ca71fbc2dece8352aa62ed7e97a3f0a13c66bf588aa757", "module": "hareOracle", "epoch": 10, "beacon": "0x2fbac248", "name": "hareOracle"}

@countvonzero
Copy link
Contributor

with this change sync and consensus result progress without halting, so maybe downloading layers is simply faster then processing them?

if memory serve that's true before we were also downloading blocks.

bors bot pushed a commit that referenced this issue Jun 21, 2023
## Motivation
part of #4552

## Changes
do not engage in hash resolution when node hasn't process all layers during sync
bors bot pushed a commit that referenced this issue Jun 21, 2023
## Motivation
part of #4552

## Changes
do not engage in hash resolution when node hasn't process all layers during sync
@countvonzero
Copy link
Contributor

@dshulyak double-checking the logic for OnAppliedBlock.

there is a for a node to change hare active set after it start participating in hare.
if there was a revert, then the network will start to disagree on hare active set.
i think it's better to stick to the first (and possibly wrong) hare active set instead of changing to the correct one.

@dshulyak
Copy link
Contributor Author

dshulyak commented Jun 22, 2023

i think it's better to stick to the first (and possibly wrong) hare active set instead of changing to the correct one.

everyone should have the same active set. node that restarts will select correct activeset from the block, and one that doesn't will stick to possibly incorrect tortoise active set. there should be always codepath that leads to the same result regardless of when method was called, current approach is simply not robust

but fixing that also doesn't solve whole problem

sync can't download and verify certificates while processed layer is so far behind. it will fail all of them if tortoise active set is not equal to the active set that was used to sign those certificates

@countvonzero
Copy link
Contributor

thinking about options:

  • A. for hare 3 we dont need agreement on hare active set(?). so maybe this issue doesn't matter anymore
  • B. the only codepath for sync to populate hare active set is to validate block certificate.
    • B.1 we don't sync block certificate at all and let tortoise figure out the consensus during catch up sync
    • B.2 we explicitly clear hare active set right before catch-up sync is complete

@dshulyak
Copy link
Contributor Author

dshulyak commented Aug 9, 2023

for hare 3 we dont need agreement on hare active set(?). so maybe this issue doesn't matter anymore

i think so to. i don't how fast this part will be implemented though, there is also a dependency to sync grades, or do something else about them. and we still want to test it well before switching

the only codepath for sync to populate hare active set is to validate block certificate.

what active set we will use for validating certificate? if it not the same activeset that was used to create it then cert validation will likely fail . but beside that, B2 doesn't seem like a complex change?

i think we can wait with this, if it will be clear that hare3 with out of consensus active set won't lend in next 4 weeks then B2 makes sense

@dshulyak
Copy link
Contributor Author

B.2 we explicitly clear hare active set right before catch-up sync is complete

@countvonzero so this is ok option? i already see it in the network, we are just lucky that first block activeset doesn't differ from tortoise (atleast on my node)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants