Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator client sync check is incompatible with recent genesis local testnets #714

Closed
michaelsproul opened this issue Dec 12, 2019 · 2 comments · Fixed by #834
Closed

Validator client sync check is incompatible with recent genesis local testnets #714

michaelsproul opened this issue Dec 12, 2019 · 2 comments · Fixed by #834

Comments

@michaelsproul
Copy link
Member

@michaelsproul michaelsproul commented Dec 12, 2019

In #656 a check was introduced in the validator client to ensure that the beacon node is actually synced before producing blocks and attestations. This is definitely a good idea, but the current implementation is a bit too restrictive in the case of a "recent genesis" testnet.

if beacon_head_epoch + 1 < current_epoch {
error!(
log,
"Beacon node is not synced";
"node_head_epoch" => format!("{}", beacon_head_epoch),
"current_epoch" => format!("{}", current_epoch),
);

If we've just booted a node, and the recent genesis is more than 2 epochs ago, then the validator client will be stuck thinking the beacon node is still attempting to sync, and the beacon node will be stuck trying to sync non-existent blocks.

This could also happen in theory on a real network -- if nobody proposed any blocks for those first two epochs -- but it's unlikely.

I'm not sure of the best way to address this. Currently I've added in an extra check && beacon_head_epoch > 0, but it's less than ideal (we don't want validator clients to always produce blocks and attestations for the genesis epoch, because they may be still syncing).

Another option might be for the VC to try and detect "real" syncing progress, but this seems fragile and likely to break. Or it could have a timeout, and start producing if it doesn't see any progress from the node after that timeout... also quite fragile.

Interested to hear your thoughts on this @paulhauner.

@paulhauner

This comment has been minimized.

Copy link
Member

@paulhauner paulhauner commented Dec 12, 2019

Oh yeah, this is interesting. Seems like a recurring case where things that are reasonable to do in testing are not reasonable for production.

My first thought is to add a --allow-unsynced flag (and perhaps set it by default whenever the testnet subcommand is used).

We do have the problem here were if mainnet gets more than 1 epoch behind we simply stop producing blocks (unless we run with the flag). I'm not sure I have the headspace to completely solve this problem at the moment, though.

I'd be tempted to add the flag for now, then make a more sophisticated solution later (e.g., one that can make some determination that it's had enough waiting).

That being said, if you feel like getting into this and solving it once-and-for-all, please don't let me stop you!

@AgeManning

This comment has been minimized.

Copy link
Member

@AgeManning AgeManning commented Dec 12, 2019

#720 May help with this.
The sync manager knows about whether it's synced or not, and it doesn't do it based on current block slot. It's based on our current blocks relative to all peers we know about. If all peers connected to us, also only have the current head_slot, we are considered synced.

In a single node situation, that node is considered stalled, and perhaps we want to produce blocks. Perhaps using the sync manager's definition of sync state solves this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants
You can’t perform that action at this time.