sync should not turn off consensus #5036

lrettig · 2023-09-19T16:19:54Z

Description

I'm running two nodes (both v1.1.6), both see exactly the same layer data, but one thinks it's synced and one doesn't:

> grpcurl -plaintext localhost:9092 spacemesh.v1.NodeService.Status
{
  "status": {
    "connectedPeers": "44",
    "syncedLayer": {
      "number": 19386
    },
    "topLayer": {
      "number": 19395
    },
    "verifiedLayer": {
      "number": 19385
    }
  }
}
> grpcurl -plaintext localhost:9192 spacemesh.v1.NodeService.Status
{
  "status": {
    "connectedPeers": "1",
    "isSynced": true,
    "syncedLayer": {
      "number": 19386
    },
    "topLayer": {
      "number": 19395
    },
    "verifiedLayer": {
      "number": 19385
    }
  }
}

The one that thinks it's out of sync keeps printing:

2023-09-19T12:17:50.296-0400 INFO abcde.sync node is too far behind {"node_id": "abcde", "module": "sync", "sessionId": "06dadc41-5e8b-4f6a-8e67-7baff90bb12c", "current": "19395", "last synced": "19389", "behind threshold": 3, "name": "sync"}

The only difference between them is that the "in sync" node is a private, local peer connected only to the other node, and the "out of sync" node is a public node with 40-50 peers. Unclear if this is related.

The text was updated successfully, but these errors were encountered:

lrettig · 2023-09-19T16:55:44Z

logfile for "out of sync" node:

log.txt.gz

dshulyak · 2023-09-19T17:53:29Z

i think this is a big problem. sync should be very conservative when it can interrupt consensus. in fact it should almost never interrupt it. current heuristic of 3 layers is no good, or something else makes it so that it doesn't work.
in latest outage it makes some nodes change sync state consistently.

dshulyak · 2023-09-20T04:20:34Z

lets change to not synced only if node was offline (0 peers reported by libp2p) for 30 minutes. in all other cases sync should not interrupt consensus

lrettig · 2023-09-20T18:54:49Z

looks like all private nodes are in sync while public ones aren't (https://discord.com/channels/623195163510046732/1141739980306272427/1153806007785496698, also: https://discord.com/channels/623195163510046732/1141739980306272427/1153911699691282482)

I noticed this behavior as well (before the patch, on v1.1.6). do we have any idea why this might've been the case?

…5040) related: #5036 in future we should drop it completely, and use only connectivity information to decide if node should stop participating in consensus. there should be no risk of interrupting consensus, because of any unexpected failures in sync process.

dshulyak · 2023-09-22T17:25:31Z

i think thats because atx sync queries are slow, and it caused sync to be delayed and fall out of expected window. below you can see that problems stopped when epoch started, and atx sync stopped.

private query public and all queries are fast, but public queries random peers on the network and responses are unpredictable. i think this just highlighted that we should not have such risky heuristics.

lrettig · 2023-09-22T17:56:13Z

For the record this is what @tal-m said - in order for a node to participate in consensus:

It definitely needs to have timing information for the Hare rounds in the current instance, and IIRC for voting on proposals it needs information about the current state and the recent transactions in the mempool (to compute "conservative balance").
For generating ballots it needs information about the recent Hare instances, and if there are layers older than hdist that haven't passed the confidence threshold it also needs all the most recent ballots.
Of course, for doing anything it needs to be able to evaluate eligibility, so it needs to know the active set and any malfeasance proofs.

dshulyak · 2023-09-22T18:26:53Z

i was advocating for something similar in this topic #4504 (comment)

but i think think hare can have more relaxed conditions, unlike tortoise hare can't vote against, so it is the same as not voting at all. i think we will remove IsSynced condition all together, and rely on specific data being available.

but what is more important is that:

sync can't stop consensus (maybe what i changed is enough, but i would remove this completely and turn off consensus only if i have 0 peers for longer than 20-30 minutes)
additionally there is a persistent issue about selecting responsive peers, which is also quite important

dshulyak · 2023-10-06T14:06:31Z

likely same problem as #4977

closes: #5127 #5036 peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer: - request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency - latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs). related: #4977 limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once. synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h

…emeshos#5143) closes: spacemeshos#5127 spacemeshos#5036 peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer: - request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency - latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs). related: spacemeshos#4977 limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once. synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h

lrettig added the bug label Sep 19, 2023

dshulyak added the area/sync label Sep 19, 2023

dshulyak changed the title ~~isSynced doesn't seem to depend on syncedLayer~~ sync should not turn off consensus Sep 20, 2023

dshulyak mentioned this issue Sep 20, 2023

[Merged by Bors] - sync: parametrize out of sync threshold and set it to 3h for mainnet #5040

Closed

dshulyak self-assigned this Oct 6, 2023

dshulyak mentioned this issue Oct 12, 2023

[Merged by Bors] - sync: prioritize peers with higher success rate and low latency #5143

Closed

dshulyak closed this as completed Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync should not turn off consensus #5036

sync should not turn off consensus #5036

lrettig commented Sep 19, 2023 •

edited

lrettig commented Sep 19, 2023

dshulyak commented Sep 19, 2023 •

edited

dshulyak commented Sep 20, 2023

lrettig commented Sep 20, 2023 •

edited

dshulyak commented Sep 22, 2023

lrettig commented Sep 22, 2023

dshulyak commented Sep 22, 2023

dshulyak commented Oct 6, 2023

sync should not turn off consensus #5036

sync should not turn off consensus #5036

Comments

lrettig commented Sep 19, 2023 • edited

Description

lrettig commented Sep 19, 2023

dshulyak commented Sep 19, 2023 • edited

dshulyak commented Sep 20, 2023

lrettig commented Sep 20, 2023 • edited

dshulyak commented Sep 22, 2023

lrettig commented Sep 22, 2023

dshulyak commented Sep 22, 2023

dshulyak commented Oct 6, 2023

lrettig commented Sep 19, 2023 •

edited

dshulyak commented Sep 19, 2023 •

edited

lrettig commented Sep 20, 2023 •

edited