-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extends crds values timeouts if stakes are unknown #17261
extends crds values timeouts if stakes are unknown #17261
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17261 +/- ##
========================================
Coverage 82.7% 82.7%
========================================
Files 421 424 +3
Lines 118048 118440 +392
========================================
+ Hits 97664 98012 +348
- Misses 20384 20428 +44 |
840e6cf
to
8c1db29
Compare
) -> HashMap<Pubkey, u64> { | ||
self.make_timeouts_def(self_id, stakes, epoch_ms, self.crds_timeout) | ||
let extended_timeout = self.crds_timeout.max(epoch_duration.as_millis() as u64); | ||
let default_timeout = if stakes.values().all(|stake| *stake == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm so the only time we default to the longer epoch timeout for unknown keys, i.e. the (Pubkey::default(), default_timeout)
entry below, is when all the stakes are zero. From reading the PR description, I'm assuming this is supposed to be only at startup?
From what I'm reading, however, some of the stakes should always be nonzero for a validator on a real network because:
-
We always pass a
BankForks
here https://github.com/solana-labs/solana/blob/master/core/src/validator.rs#L600 -
Which passes into GossipService: https://github.com/solana-labs/solana/blob/master/core/src/gossip_service.rs#L66
-
Which means there should be some stake in this
staked_nodes
https://github.com/solana-labs/solana/blob/master/core/src/cluster_info.rs#L1848 which is then passed here tomake_timeouts()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, to adopt the shred version from the entrypoint, the validator starts a gossip service with no bank-forks:
https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L369
That is the primary motivation for this change, because then the validator assigns short timeouts to all crds values, and when it gets a response from the entrypoint, most often it discards it right away. (This exacerbates the start-up wait and bandwidth usage because the values previously fetched are not excluded from the next pull request to the entrypoint and they are likely redundantly sent again).
But, besides that, I see non-zero values for cluster_info-purge-no_working_bank
metric on mainnet, which then indicates that the stakes were unknown when computing timeouts:
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/cluster_info.rs#L1750
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/cluster_info.rs#L2528
So besides starting validator, I think for an already running validator we are currently hitting cases where stakes are unknown. (Presumably because of a no working bank?!)
Please note that cluster_info-purge-no_working_bank
is kind of a misnomer. There are 2 instances of the matric in the code and one is actually using the root bank. This change is updates both places to always consistently use the root-bank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation! Yeah it seems in both cases cluster_info-purge-no_working_bank
is triggered when there's no BankForks
, so I would assume its the same case you described above the validator starts a gossip service with no bank-forks:
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded.
8c1db29
to
a17e33f
Compare
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded. (cherry picked from commit 2adce67)
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded. (cherry picked from commit 2adce67) Co-authored-by: behzad nouri <behzadnouri@gmail.com>
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded.
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded.
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded.
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded.
If stakes are unknown, then timeouts will be short, resulting in values being purged from the crds table, and consequently higher pull-response load when they are obtained again from gossip. In particular, this slows down validator start where almost all values obtained from entrypoint are immediately discarded.
Problem
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
Summary of Changes
extend crds values timeouts if stakes are unknown