Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extends crds values timeouts if stakes are unknown #17261

Merged
merged 3 commits into from
May 21, 2021

Conversation

behzadnouri
Copy link
Contributor

Problem

If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.

Summary of Changes

extend crds values timeouts if stakes are unknown

@codecov
Copy link

codecov bot commented May 16, 2021

Codecov Report

Merging #17261 (a17e33f) into master (6cba534) will increase coverage by 0.0%.
The diff coverage is 93.0%.

@@           Coverage Diff            @@
##           master   #17261    +/-   ##
========================================
  Coverage    82.7%    82.7%            
========================================
  Files         421      424     +3     
  Lines      118048   118440   +392     
========================================
+ Hits        97664    98012   +348     
- Misses      20384    20428    +44     

@behzadnouri behzadnouri marked this pull request as ready for review May 16, 2021 18:51
@behzadnouri behzadnouri force-pushed the no-stake-timeouts branch 8 times, most recently from 840e6cf to 8c1db29 Compare May 19, 2021 21:27
) -> HashMap<Pubkey, u64> {
self.make_timeouts_def(self_id, stakes, epoch_ms, self.crds_timeout)
let extended_timeout = self.crds_timeout.max(epoch_duration.as_millis() as u64);
let default_timeout = if stakes.values().all(|stake| *stake == 0) {
Copy link
Contributor

@carllin carllin May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm so the only time we default to the longer epoch timeout for unknown keys, i.e. the (Pubkey::default(), default_timeout) entry below, is when all the stakes are zero. From reading the PR description, I'm assuming this is supposed to be only at startup?

From what I'm reading, however, some of the stakes should always be nonzero for a validator on a real network because:

  1. We always pass a BankForks here https://github.com/solana-labs/solana/blob/master/core/src/validator.rs#L600

  2. Which passes into GossipService: https://github.com/solana-labs/solana/blob/master/core/src/gossip_service.rs#L66

  3. Which means there should be some stake in this staked_nodes https://github.com/solana-labs/solana/blob/master/core/src/cluster_info.rs#L1848 which is then passed here to make_timeouts()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, to adopt the shred version from the entrypoint, the validator starts a gossip service with no bank-forks:
https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L369

That is the primary motivation for this change, because then the validator assigns short timeouts to all crds values, and when it gets a response from the entrypoint, most often it discards it right away. (This exacerbates the start-up wait and bandwidth usage because the values previously fetched are not excluded from the next pull request to the entrypoint and they are likely redundantly sent again).

But, besides that, I see non-zero values for cluster_info-purge-no_working_bank metric on mainnet, which then indicates that the stakes were unknown when computing timeouts:
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/cluster_info.rs#L1750
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/cluster_info.rs#L2528

So besides starting validator, I think for an already running validator we are currently hitting cases where stakes are unknown. (Presumably because of a no working bank?!)

Please note that cluster_info-purge-no_working_bank is kind of a misnomer. There are 2 instances of the matric in the code and one is actually using the root bank. This change is updates both places to always consistently use the root-bank.

Copy link
Contributor

@carllin carllin May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation! Yeah it seems in both cases cluster_info-purge-no_working_bank is triggered when there's no BankForks, so I would assume its the same case you described above the validator starts a gossip service with no bank-forks:

carllin
carllin previously approved these changes May 20, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
@mergify mergify bot dismissed carllin’s stale review May 21, 2021 14:19

Pull request has been modified.

@behzadnouri behzadnouri merged commit 2adce67 into solana-labs:master May 21, 2021
@behzadnouri behzadnouri deleted the no-stake-timeouts branch May 21, 2021 15:55
mergify bot pushed a commit that referenced this pull request May 21, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.

(cherry picked from commit 2adce67)
mergify bot added a commit that referenced this pull request May 21, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.

(cherry picked from commit 2adce67)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request May 22, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request May 24, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request May 24, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request May 24, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request May 24, 2021
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants