Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator cannot catchup in TdS with v0.23.6 #8406

Closed
kwunyeung opened this issue Feb 23, 2020 · 5 comments
Closed

Validator cannot catchup in TdS with v0.23.6 #8406

kwunyeung opened this issue Feb 23, 2020 · 5 comments
Assignees

Comments

@kwunyeung
Copy link

kwunyeung commented Feb 23, 2020

Problem

Many validators cannot catchup in TdS after the restart with v0.23.6. The log keeps repeating this.


Feb 23 10:02:58 solana-validator-eu solana-validator[5112]: [2020-02-23T10:02:58.180738984Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair_orphan repair-orphan=1533611i
Feb 23 10:02:58 solana-validator-eu solana-validator[5112]: [2020-02-23T10:02:58.180743288Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair_orphan repair-orphan=1534123i
Feb 23 10:02:58 solana-validator-eu solana-validator[5112]: [2020-02-23T10:02:58.180807848Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair_orphan repair-orphan=1535467i
Feb 23 10:02:58 solana-validator-eu solana-validator[5112]: [2020-02-23T10:02:58.180813023Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair_orphan repair-orphan=1536147i
Feb 23 10:02:58 solana-validator-eu solana-validator[5112]: [2020-02-23T10:02:58.180885828Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair_orphan repair-orphan=1536567i
Feb 23 10:02:58 solana-validator-eu solana-validator[5112]: [2020-02-23T10:02:58.282431063Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair_highest repair-highest-slot=1532964i repair-highest-ix=0i

Slots is always falling behind.

⠄ Validator is 140572 slots away (us:1575619 them:1716191) and falling behind at -3.2 slots/second

I have tried removing the ledger and restart the node. The snapshot is always starting at height 1532964.

Proposed Solution

@mvines mvines self-assigned this Feb 23, 2020
@mvines mvines added this to the Tofino v0.23.7 milestone Feb 23, 2020
@mvines
Copy link
Member

mvines commented Feb 23, 2020

A full log of the solana-validator process output would be very helpful to debug this

@mvines
Copy link
Member

mvines commented Feb 23, 2020

Ah, I'm able to reproduce this now. First my validator says:

solana_validator] Highest available snapshot slot is 1725568

and then it downloads a snapshot for a really old slot:

solana_ledger::snapshot_utils] Loaded bank for slot: 1532964

@mvines
Copy link
Member

mvines commented Feb 23, 2020

Some nodes, such as 206.189.221.221, seem to not be generating actual snapshot.tar.bz2 files even though they're publishing newer snapshot slots in gossip. We need to tighten down the way that a snapshot is fetched such that the requesting node can specify a specific snapshot slot/hash and if it doesn't get exactly that back then it aborts and retries (perhaps blacklisting that node).

@mvines
Copy link
Member

mvines commented Feb 23, 2020

The temporary workaround for Tds stage 1 is to manually download a snapshot from the bootstrap leader:

$ rm -rf ledger/
$ mkdir ledger/
$ curl http://216.24.140.155:8899/snapshot.tar.bz2 --output ledger/snapshot.tar.bz2 

and then quickly start your validator with the --no-snapshot-fetch option to ensure that it uses the snapshot.tar.bz2 that was just downloaded instead of trying to re-fetch a new one.

@mvines
Copy link
Member

mvines commented Feb 27, 2020

Fixed by #8482

@mvines mvines closed this as completed Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants