Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local cluster tests that broadcast duplicate slots #13995

Merged
merged 3 commits into from
Jun 9, 2021

Conversation

jstarry
Copy link
Member

@jstarry jstarry commented Dec 7, 2020

Problem

Lacking local cluster tests for faulty nodes which sends duplicate shreds for complete blocks

Summary of Changes

  • Added new "duplicate node" local cluster test which runs two nodes with the same keys (one gets setup after the other)
  • Added a new broadcast stage for sending duplicate shreds:
    • Partitions the cluster based on stake
    • Does not send invalid entries (which would just cause the recipient to mark the slot as dead)
    • Wait to broadcast alternative version of a slot until all shreds can be sent at once (to avoid nodes getting duplicate shreds from repair)
    • Broadcasts duplicate shreds after a configurable slot delay

@jstarry jstarry added the noCI Suppress CI on this Pull Request label Dec 7, 2020
@jstarry
Copy link
Member Author

jstarry commented Dec 7, 2020

Test commands with log level

RUST_LOG=warn,solana_local_cluster=info RUST_BACKTRACE=1 cargo test --package solana-local-cluster --test local_cluster -- test_duplicate_node --exact --nocapture --ignored
RUST_LOG=warn,solana_local_cluster=info RUST_BACKTRACE=1 cargo test --package solana-local-cluster --test local_cluster -- test_fake_shreds_broadcast_leader --exact --nocapture --ignored

@jstarry
Copy link
Member Author

jstarry commented Dec 7, 2020

cc @carllin @sakridge

@carllin
Copy link
Contributor

carllin commented Dec 7, 2020

@jstarry thanks!

@stale
Copy link

stale bot commented Dec 24, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 24, 2020
@jstarry
Copy link
Member Author

jstarry commented Dec 29, 2020

@carllin we still want this to land after your changes right? If so, I'll keep it from getting going stale

@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 29, 2020
@carllin
Copy link
Contributor

carllin commented Dec 29, 2020

@jstarry yeah, there's a few more changes before this one can go in, but that's the goal!

@stale
Copy link

stale bot commented Jan 6, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 6, 2021
@jstarry jstarry removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 6, 2021
@stale
Copy link

stale bot commented Jan 14, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 14, 2021
@jstarry jstarry removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 14, 2021
@stale
Copy link

stale bot commented Jan 23, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 23, 2021
@jstarry jstarry removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 23, 2021
@stale
Copy link

stale bot commented Feb 7, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 7, 2021
@jstarry jstarry removed the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 8, 2021
@stale
Copy link

stale bot commented Feb 16, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 16, 2021
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 15, 2021
@jstarry jstarry marked this pull request as ready for review March 22, 2021 06:57
@jstarry jstarry requested a review from carllin March 22, 2021 08:11
@carllin
Copy link
Contributor

carllin commented Mar 22, 2021

Found one more edge case, fix here: carllin@e5cacb1

Broadcasts need to handle interrupted slots, otherwise they'll start from the shred index of the previous interrupted block

@jstarry jstarry added the CI Pull Request is ready to enter CI label Mar 23, 2021
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Mar 23, 2021
@stale
Copy link

stale bot commented Apr 19, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 19, 2021
@stale
Copy link

stale bot commented Jun 2, 2021

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jun 2, 2021
@carllin carllin reopened this Jun 2, 2021
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 2, 2021
@jstarry jstarry force-pushed the dupe-block-tests branch 3 times, most recently from 9266c31 to 1d429d1 Compare June 3, 2021 01:08
@codecov
Copy link

codecov bot commented Jun 6, 2021

Codecov Report

Merging #13995 (e7338b1) into master (8e1e57a) will decrease coverage by 0.1%.
The diff coverage is 0.5%.

@@            Coverage Diff            @@
##           master   #13995     +/-   ##
=========================================
- Coverage    82.7%    82.6%   -0.2%     
=========================================
  Files         430      431      +1     
  Lines      120480   120674    +194     
=========================================
- Hits        99734    99728      -6     
- Misses      20746    20946    +200     

@carllin
Copy link
Contributor

carllin commented Jun 7, 2021

Tagging @AshwinSekar, these are some tests, specifically test_duplicate_shreds_broadcast_leader that tests a leader broadcasting two versions of the same block. Might be useful to integrate this into a test as well across multiple nodes (can spin up a single bad leader with the custom broadcast stage BroadcastStageType::BroadcastDuplicates I think).

@carllin carllin requested a review from AshwinSekar June 7, 2021 22:11
let data_shreds = Arc::new(data_shreds);
blockstore_sender.send((data_shreds.clone(), None))?;

// 3) Start broadcast step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why do we send out the duplicate shreds first before the real recipients?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's arbitrary. They should be received around the same time. The key part here is just that two sets of nodes get different versions of the same block and they are completely unaware that another version exists. Later, we send out a shred that belongs to the duplicate version so that nodes learn of the existence of a duplicate and start acting to fix the forking.

@jstarry jstarry added the v1.7 label Jun 9, 2021
@jstarry jstarry merged commit 050bb54 into solana-labs:master Jun 9, 2021
@jstarry jstarry deleted the dupe-block-tests branch June 9, 2021 22:03
mergify bot pushed a commit that referenced this pull request Jun 9, 2021
* Add duplicate node local cluster test

* fix clippy

* remove dupe test

(cherry picked from commit 050bb54)
mergify bot added a commit that referenced this pull request Jun 10, 2021
* Add duplicate node local cluster test

* fix clippy

* remove dupe test

(cherry picked from commit 050bb54)

Co-authored-by: Justin Starry <justin@solana.com>
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants