Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implements an unbiased weighted shuffle using binary indexed tree #18343

Merged

Conversation

behzadnouri
Copy link
Contributor

@behzadnouri behzadnouri commented Jun 30, 2021

Problem

Current implementation of weighted_shuffle:
https://github.com/solana-labs/solana/blob/b08f8bd1b/gossip/src/weighted_shuffle.rs#L11-L37
uses a heuristic which results in biased samples.

For example, if the weights are [1, 10, 100], then the 3rd index should
come first 100 times more often than the 1st index. However,
weighted_shuffle is picking the 3rd index 200+ times more often than the
1st index, showing a disproportional bias in favor of higher weights.

Summary of Changes

This commit implements weighted shuffle using binary indexed tree to
maintain cumulative sum of weights while sampling. The resulting samples
are demonstrably unbiased and precisely proportional to the weights.

Additionally the iterator interface allows to skip computations when
not all indices are processed.

Of the use cases of weighted_shuffle, changing turbine code requires
feature-gating to keep the cluster in sync. That is not updated in
this commit, but can be done together with future updates to turbine.

@behzadnouri behzadnouri force-pushed the weighted-shuffle-fenwick branch 3 times, most recently from f866ecd to 856bfca Compare July 1, 2021 14:09
@codecov
Copy link

codecov bot commented Jul 1, 2021

Codecov Report

Merging #18343 (d912b34) into master (0426c2d) will increase coverage by 0.0%.
The diff coverage is 97.4%.

@@           Coverage Diff            @@
##           master   #18343    +/-   ##
========================================
  Coverage    82.5%    82.6%            
========================================
  Files         436      436            
  Lines      121801   121903   +102     
========================================
+ Hits       100602   100693    +91     
- Misses      21199    21210    +11     

@behzadnouri behzadnouri marked this pull request as ready for review July 1, 2021 16:26
Current implementation of weighted_shuffle:
https://github.com/solana-labs/solana/blob/b08f8bd1b/gossip/src/weighted_shuffle.rs#L11-L37
uses a heuristic which results in biased samples.

For example, if the weights are [1, 10, 100], then the 3rd index should
come first 100 times more often than the 1st index. However,
weighted_shuffle is picking the 3rd index 200+ times more often than the
1st index, showing a disproportional bias in favor of higher weights.

This commit implements weighted shuffle using binary indexed tree to
maintain cumulative sum of weights while sampling. The resulting samples
are demonstrably unbiased and precisely proportional to the weights.

Additionally the iterator interface allows to skip computations when
not all indices are processed.

Of the use cases of weighted_shuffle, changing turbine code requires
feature-gating to keep the cluster in sync. That is not updated in
this commit, but can be done together with future updates to turbine.
Comment on lines 304 to +309
let since = (now.saturating_sub(req_time).min(3600 * 1000) / 1024) as u32;
let stake = get_stake(&item.id, stakes);
let weight = get_weight(max_weight, since, stake);
(weight, item)
// Weights are bounded by max_weight defined above.
// So this type-cast should be safe.
((weight * 100.0) as u64, item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a function for this calculation to use here and in crds_gossip_push.rs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think might be good to refactor push_options and pull_options to share the common parts. But I would rather not to mix that refactoring with this pull-request.

Copy link
Contributor

@jbiseda jbiseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@behzadnouri behzadnouri merged commit dba42c5 into solana-labs:master Jul 7, 2021
@behzadnouri behzadnouri deleted the weighted-shuffle-fenwick branch July 7, 2021 14:14
mergify bot pushed a commit that referenced this pull request Jul 7, 2021
…8343)

Current implementation of weighted_shuffle:
https://github.com/solana-labs/solana/blob/b08f8bd1b/gossip/src/weighted_shuffle.rs#L11-L37
uses a heuristic which results in biased samples.

For example, if the weights are [1, 10, 100], then the 3rd index should
come first 100 times more often than the 1st index. However,
weighted_shuffle is picking the 3rd index 200+ times more often than the
1st index, showing a disproportional bias in favor of higher weights.

This commit implements weighted shuffle using binary indexed tree to
maintain cumulative sum of weights while sampling. The resulting samples
are demonstrably unbiased and precisely proportional to the weights.

Additionally the iterator interface allows to skip computations when
not all indices are processed.

Of the use cases of weighted_shuffle, changing turbine code requires
feature-gating to keep the cluster in sync. That is not updated in
this commit, but can be done together with future updates to turbine.

(cherry picked from commit dba42c5)
mergify bot added a commit that referenced this pull request Jul 7, 2021
…8343) (#18485)

Current implementation of weighted_shuffle:
https://github.com/solana-labs/solana/blob/b08f8bd1b/gossip/src/weighted_shuffle.rs#L11-L37
uses a heuristic which results in biased samples.

For example, if the weights are [1, 10, 100], then the 3rd index should
come first 100 times more often than the 1st index. However,
weighted_shuffle is picking the 3rd index 200+ times more often than the
1st index, showing a disproportional bias in favor of higher weights.

This commit implements weighted shuffle using binary indexed tree to
maintain cumulative sum of weights while sampling. The resulting samples
are demonstrably unbiased and precisely proportional to the weights.

Additionally the iterator interface allows to skip computations when
not all indices are processed.

Of the use cases of weighted_shuffle, changing turbine code requires
feature-gating to keep the cluster in sync. That is not updated in
this commit, but can be done together with future updates to turbine.

(cherry picked from commit dba42c5)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Apr 5, 2022
Older weighted_shuffle is based on a heuristic which results in biased
samples as shown in:
solana-labs#18343
and can be replaced with WeightedShuffle.

Also, as described in:
solana-labs#13919
weighted_best can be replaced with rand::distributions::WeightedIndex,
or WeightdShuffle::first.
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Apr 5, 2022
Older weighted_shuffle is based on a heuristic which results in biased
samples as shown in:
solana-labs#18343
and can be replaced with WeightedShuffle.

Also, as described in:
solana-labs#13919
weighted_best can be replaced with rand::distributions::WeightedIndex,
or WeightdShuffle::first.
behzadnouri added a commit that referenced this pull request Apr 5, 2022
Older weighted_shuffle is based on a heuristic which results in biased
samples as shown in:
#18343
and can be replaced with WeightedShuffle.

Also, as described in:
#13919
weighted_best can be replaced with rand::distributions::WeightedIndex,
or WeightdShuffle::first.
mergify bot pushed a commit that referenced this pull request Apr 6, 2022
Older weighted_shuffle is based on a heuristic which results in biased
samples as shown in:
#18343
and can be replaced with WeightedShuffle.

Also, as described in:
#13919
weighted_best can be replaced with rand::distributions::WeightedIndex,
or WeightdShuffle::first.

(cherry picked from commit db23295)
behzadnouri added a commit that referenced this pull request Apr 7, 2022
…24139)

Older weighted_shuffle is based on a heuristic which results in biased
samples as shown in:
#18343
and can be replaced with WeightedShuffle.

Also, as described in:
#13919
weighted_best can be replaced with rand::distributions::WeightedIndex,
or WeightdShuffle::first.

(cherry picked from commit db23295)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants