Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bumps up min number of bloom items in gossip pull requests #17236

Merged
merged 1 commit into from
May 21, 2021

Conversation

behzadnouri
Copy link
Contributor

@behzadnouri behzadnouri commented May 14, 2021

Problem

When a validator starts, it has an (almost) empty crds table and it only sends one pull-request to the entrypoint. The bloom filter in the pull-request targets 10% false rate given the number of items. So, if the num_items is very wrong, it makes a very small bloom filter with a very high false rate:
https://github.com/solana-labs/solana/blob/2ae57c172/runtime/src/bloom.rs#L70-L80
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L48

As a result, it is very unlikely that the validator obtains entrypoint's contact-info in response. This exacerbates how long the validator will loop on:

Waiting to adopt entrypoint shred version

https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L390-L412

Summary of Changes

This commit increases the min number of bloom items when making gossip pull requests. Effectively this will break the entrypoint crds table into 64 shards, one pull-request for each, a larger bloom filter for each shard, and increases the chances that the response will include entrypoint's contact-info, which is needed for adopting shred version and validator start.

@codecov
Copy link

codecov bot commented May 14, 2021

Codecov Report

Merging #17236 (2345372) into master (2ae57c1) will decrease coverage by 0.0%.
The diff coverage is 98.2%.

@@            Coverage Diff            @@
##           master   #17236     +/-   ##
=========================================
- Coverage    82.7%    82.7%   -0.1%     
=========================================
  Files         423      423             
  Lines      118112   118114      +2     
=========================================
+ Hits        97687    97688      +1     
- Misses      20425    20426      +1     

@behzadnouri behzadnouri marked this pull request as ready for review May 15, 2021 17:09
@behzadnouri behzadnouri force-pushed the inspect-entrypoint branch 3 times, most recently from 2f47b91 to da90fc5 Compare May 19, 2021 22:40
@carllin
Copy link
Contributor

carllin commented May 20, 2021

Hmmm, so if there are no entries in the Crds table, then only one filter will be built in the CrdsFilterSet with mask_bits = 0. Then this gets made into a CrdsFilterhere withmask_bits = 0andmask = !0x0`.

As a result, it is very unlikely that it obtains entrypoint's contact-info in response

My question is, when the entrypoint gets this message, do they not respond with everything in their table because this mask with mask_bits=0 shouldn't filter out anything?

@behzadnouri
Copy link
Contributor Author

My question is, when the entrypoint gets this message, do they not respond with everything in their table because this mask with mask_bits=0 shouldn't filter out anything?

So, the mask_bits effectively only breaks the entrypoint's crds table into a number of shards (and with mask_bits = 0 there will be only one shard for the entire table). Then the bloom filters are checked to filter out what values the validator already has:
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L525-L537

The bloom filter target 10% false rate given the number of items. So, if the num_items is very wrong, it makes a very small bloom filter with a very high false rate:
https://github.com/solana-labs/solana/blob/2ae57c172/runtime/src/bloom.rs#L70-L80
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L48

The validator always have a couple of items in its crds table (its own contact-info, version, node-instance). So the bloom filter always have some bits set causing the false collisions.

I realize the description of this pull-request is not good; I will update the description as well.

When a validator starts, it has an (almost) empty crds table and it only
sends one pull-request to the entrypoint. The bloom filter in the
pull-request targets 10% false rate given the number of items. So, if
the `num_items` is very wrong, it makes a very small bloom filter with a
very high false rate:
https://github.com/solana-labs/solana/blob/2ae57c172/runtime/src/bloom.rs#L70-L80
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L48

As a result, it is very unlikely that the validator obtains entrypoint's
contact-info in response. This exacerbates how long the validator will
loop on:
    > Waiting to adopt entrypoint shred version
https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L390-L412

This commit increases the min number of bloom items when making gossip
pull requests. Effectively this will break the entrypoint crds table
into 64 shards, one pull-request for each, a larger bloom filter for
each shard, and increases the chances that the response will include
entrypoint's contact-info, which is needed for adopting shred version
and validator start.
@carllin
Copy link
Contributor

carllin commented May 20, 2021

@behzadnouri Ah I see, the bloom is built for a 10% false rate given very little expected default capacity, so here we are bumping the expected default capacity, thanks for the explanation!

crds.len() + self.purged_values.len() + self.failed_inserts.len(),
);
let filters = CrdsFilterSet::new(num, bloom_size);
const MIN_NUM_BLOOM_ITEMS: usize = 65_536;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a moving target as the number of validators on the network increases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will see, but probably with #17261 we should be fine

Comment on lines +297 to +301
let bloom = AtomicBloom::from(Bloom::random(
num_bloom_items,
BLOOM_FALSE_RATE,
BLOOM_MAX_BITS,
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for cleaning this up!

@behzadnouri behzadnouri merged commit e8b35a4 into solana-labs:master May 21, 2021
@behzadnouri behzadnouri deleted the inspect-entrypoint branch May 21, 2021 13:59
mergify bot pushed a commit that referenced this pull request May 21, 2021
When a validator starts, it has an (almost) empty crds table and it only
sends one pull-request to the entrypoint. The bloom filter in the
pull-request targets 10% false rate given the number of items. So, if
the `num_items` is very wrong, it makes a very small bloom filter with a
very high false rate:
https://github.com/solana-labs/solana/blob/2ae57c172/runtime/src/bloom.rs#L70-L80
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L48

As a result, it is very unlikely that the validator obtains entrypoint's
contact-info in response. This exacerbates how long the validator will
loop on:
    > Waiting to adopt entrypoint shred version
https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L390-L412

This commit increases the min number of bloom items when making gossip
pull requests. Effectively this will break the entrypoint crds table
into 64 shards, one pull-request for each, a larger bloom filter for
each shard, and increases the chances that the response will include
entrypoint's contact-info, which is needed for adopting shred version
and validator start.

(cherry picked from commit e8b35a4)
mergify bot added a commit that referenced this pull request May 21, 2021
…17383)

When a validator starts, it has an (almost) empty crds table and it only
sends one pull-request to the entrypoint. The bloom filter in the
pull-request targets 10% false rate given the number of items. So, if
the `num_items` is very wrong, it makes a very small bloom filter with a
very high false rate:
https://github.com/solana-labs/solana/blob/2ae57c172/runtime/src/bloom.rs#L70-L80
https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L48

As a result, it is very unlikely that the validator obtains entrypoint's
contact-info in response. This exacerbates how long the validator will
loop on:
    > Waiting to adopt entrypoint shred version
https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L390-L412

This commit increases the min number of bloom items when making gossip
pull requests. Effectively this will break the entrypoint crds table
into 64 shards, one pull-request for each, a larger bloom filter for
each shard, and increases the chances that the response will include
entrypoint's contact-info, which is needed for adopting shred version
and validator start.

(cherry picked from commit e8b35a4)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
behzadnouri added a commit to behzadnouri/solana that referenced this pull request May 23, 2021
coverage ci builds are failing presumably because of the overhead added
in: solana-labs#17236
for very small test cluster.

This commit uses smaller min number of bloom items condition on that if
debug assertions are enabled or not.
behzadnouri added a commit to behzadnouri/solana that referenced this pull request May 23, 2021
coverage ci builds are failing presumably because of the overhead added
in: solana-labs#17236
for very small test cluster.

Previous attempt at fixing the flakiness:
solana-labs#17408

This commit uses smaller min number of bloom items condition on that if
debug assertions are enabled or not.
behzadnouri added a commit to behzadnouri/solana that referenced this pull request May 23, 2021
coverage ci builds are have become flaky presumably because of the
overhead added in solana-labs#17236
for very small test clusters.

This commit uses a smaller min number of bloom items condition on that
if debug assertions are enabled or not.

Previous attempt at fixing the flakiness:
solana-labs#17408
behzadnouri added a commit that referenced this pull request May 23, 2021
coverage ci builds are have become flaky presumably because of the
overhead added in #17236
for very small test clusters.

This commit uses a smaller min number of bloom items condition on that
if debug assertions are enabled or not.

Previous attempt at fixing the flakiness:
#17408
mergify bot pushed a commit that referenced this pull request May 23, 2021
coverage ci builds are have become flaky presumably because of the
overhead added in #17236
for very small test clusters.

This commit uses a smaller min number of bloom items condition on that
if debug assertions are enabled or not.

Previous attempt at fixing the flakiness:
#17408

(cherry picked from commit 5567305)
mergify bot added a commit that referenced this pull request May 23, 2021
coverage ci builds are have become flaky presumably because of the
overhead added in #17236
for very small test clusters.

This commit uses a smaller min number of bloom items condition on that
if debug assertions are enabled or not.

Previous attempt at fixing the flakiness:
#17408

(cherry picked from commit 5567305)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants