Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup ReplayStage confirmation scaffolding for duplicate slots #9698

Merged
merged 16 commits into from
Mar 25, 2021

Conversation

carllin
Copy link
Contributor

@carllin carllin commented Apr 24, 2020

Problem

Duplicate versions of a slot potentially partition and stall the cluster

Summary of Changes

Here's how I imagine the full fix working:

What's in this PR (goal is to prevent voting on duplicate slots asap):

  1. Duplicate slots are detected in WindowService and sent to ReplayStage through a channel. ReplayStage processes them here: https://github.com/solana-labs/solana/compare/master...carllin:FixReplayStage?expand=1#diff-5984b6b0429f857c13a2a362669ab37cR533. Two cases:
  1. In 1b) after a fork has been marked duplicate, we remove it as a candidate for voting here:
    https://github.com/solana-labs/solana/compare/master...carllin:FixReplayStage?expand=1#diff-5984b6b0429f857c13a2a362669ab37cR1326-R1331. However, if you've already voted on this fork, and you can't generate a switching prof, then you will continue to generate banks on this fork
    to avoid liveness issues (more details here: https://github.com/solana-labs/solana/compare/master...carllin:FixReplayStage?expand=1#diff-5984b6b0429f857c13a2a362669ab37cR1271-R1278)

  2. If a duplicate slot is confirmed, the entire fork is added back into the candidate set by clearing the duplicate flag here: https://github.com/solana-labs/solana/compare/master...carllin:FixReplayStage?expand=1#diff-fb925e9cb1a6c2044ceaa55aa7c8f255R393

What's in follow-up PR:

  1. Need to detect if cluster has confirmed some alternate version V of a duplicate slot (Imagine a validator has a dead slot, or a valid, playable version of a slot, but rest of the cluster confirmed a different slot). Can this be done by having another version of EpochSlots, but for confirmed slots instead of completed slots?

  2. If you see a slot is confirmed by supermajority in 1) and your version of the slot is dead or unconfirmed for some expiration time, then dump your version of the slot and download another version from a trusted validator OR random stake weighted validator who claims they have a confirmed version of the block. For v2 we can ask for a proof with an RSA accumulator that this version of the slot is the one included in a future confirmed block

  3. When a confirmed version of a duplicate slot with hash V is found, and it's not the same as your currently played version:
    a) Clear the currently played version from
    i) status cache: https://github.com/solana-labs/solana/compare/master...carllin:FixReplayStage?expand=1#diff-92c739d9ad61135b886d1a44957fe485R83
    ii) and progress map in replay stage: https://github.com/solana-labs/solana/compare/master...carllin:FixReplayStage?expand=1#diff-5984b6b0429f857c13a2a362669ab37cR486
    b) Set the confirmed blockhash in blockstore.

Fixes #

@carllin carllin force-pushed the FixReplayStage branch 2 times, most recently from 7fd3b29 to 50204e9 Compare April 25, 2020 02:45
@codecov
Copy link

codecov bot commented Apr 25, 2020

Codecov Report

Merging #9698 (5b78218) into master (6271665) will increase coverage by 0.1%.
The diff coverage is 92.7%.

@@            Coverage Diff            @@
##           master    #9698     +/-   ##
=========================================
+ Coverage    79.9%    80.0%   +0.1%     
=========================================
  Files         409      410      +1     
  Lines      107768   108588    +820     
=========================================
+ Hits        86179    86950    +771     
- Misses      21589    21638     +49     

@stale
Copy link

stale bot commented May 5, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label May 5, 2020
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label May 6, 2020
@stale
Copy link

stale bot commented May 16, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label May 16, 2020
@mvines
Copy link
Member

mvines commented May 22, 2020

@carllin - sup with this PR? Is it ded?

@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label May 22, 2020
@carllin
Copy link
Contributor Author

carllin commented May 22, 2020

@mvines it's on hold because I haven't had time to get it in yet 😢

@stale
Copy link

stale bot commented May 29, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label May 29, 2020
@stale
Copy link

stale bot commented Jun 5, 2020

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jun 5, 2020
@aeyakovenko aeyakovenko reopened this Dec 7, 2020
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 7, 2020
@carllin carllin force-pushed the FixReplayStage branch 6 times, most recently from 5dee535 to 9a2c6a2 Compare December 13, 2020 01:14
@carllin carllin force-pushed the FixReplayStage branch 2 times, most recently from 369f099 to 888a077 Compare December 16, 2020 10:18
@carllin
Copy link
Contributor Author

carllin commented Mar 23, 2021

want to add a design doc as well?

yup, will update the existing docs soon

@mergify mergify bot dismissed jstarry’s stale review March 23, 2021 09:05

Pull request has been modified.

@carllin carllin added the v1.6 label Mar 25, 2021
@carllin carllin merged commit 52703ba into solana-labs:master Mar 25, 2021
mergify bot pushed a commit that referenced this pull request Mar 25, 2021
carllin added a commit that referenced this pull request Mar 25, 2021
t-nelson pushed a commit that referenced this pull request Mar 30, 2021
t-nelson pushed a commit that referenced this pull request Mar 30, 2021
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants