-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retransmits shreds recovered from erasure codes #19233
Conversation
12b60d2
to
0de4fc8
Compare
Codecov Report
@@ Coverage Diff @@
## master #19233 +/- ##
=========================================
- Coverage 82.8% 82.8% -0.1%
=========================================
Files 455 455
Lines 130044 129960 -84
=========================================
- Hits 107780 107692 -88
- Misses 22264 22268 +4 |
0de4fc8
to
4e06f9c
Compare
Working towards sending shreds (instead of packets) to retransmit stage so that shreds recovered from erasure codes are as well retransmitted. Following commit will add these metrics back to window-service, earlier in the pipeline.
Adding back these metrics from the earlier commit which removed them from retransmit stage.
Working towards channelling through shreds recovered from erasure codes to retransmit stage.
instead of opaque (u32, u32) which are then converted to CompletedDataSetInfo at the call-site.
Shreds recovered from erasure codes have not been received from turbine and have not been retransmitted to other nodes downstream. This results in more repairs across the cluster which is slower. This commit channels through recovered shreds to retransmit stage in order to further broadcast the shreds to downstream nodes in the tree.
4e06f9c
to
8f5553c
Compare
.zip(&repair_infos) | ||
.filter(|(_, repair_info)| repair_info.is_none()) | ||
.map(|(shred, _)| shred) | ||
.cloned() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if we could avoid costly clones if we switched over to some Arc<Shred>
, but that's a story for another time
rpc_subscriptions: Option<Arc<RpcSubscriptions>>, | ||
duplicate_slots_sender: Sender<Slot>, | ||
ancestor_hashes_replay_update_receiver: AncestorHashesReplayUpdateReceiver, | ||
) -> Self { | ||
let (retransmit_sender, retransmit_receiver) = channel(); | ||
// https://github.com/rust-lang/rust/issues/39364#issuecomment-634545136 | ||
let _retransmit_sender = retransmit_sender.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ick, I've run into this too, was annoying to debug
Hmm, what if in cases where we just happen to receive some combination of erasure shreds + data shreds first such that we recover some data shreds that we eventually would have gotten from turbine. Will this cause a massive spike in bandwidth if we then retransmit these recovered shreds, even though turbine had no problem circulating them? To avoid this, should we wait for a bit to see that these recovered shreds don't ultimately arrive through turbine? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is so much cleaner, awesome! 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
bank_forks: Arc<RwLock<BankForks>>, | ||
retransmit: PacketSender, | ||
retransmit_sender: Sender<Vec<Shred>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for renaming these :)
That should be picked up by duplicates check in retransmit stage (i.e. |
…0249) * removes packet-count metrics from retransmit stage Working towards sending shreds (instead of packets) to retransmit stage so that shreds recovered from erasure codes are as well retransmitted. Following commit will add these metrics back to window-service, earlier in the pipeline. (cherry picked from commit bf437b0) # Conflicts: # core/src/retransmit_stage.rs * adds packet/shred count stats to window-service Adding back these metrics from the earlier commit which removed them from retransmit stage. (cherry picked from commit 8198a7e) * removes erroneous uses of Arc<...> from retransmit stage (cherry picked from commit 6e41333) # Conflicts: # core/src/retransmit_stage.rs # core/src/tvu.rs * sends shreds (instead of packets) to retransmit stage Working towards channelling through shreds recovered from erasure codes to retransmit stage. (cherry picked from commit 3efccbf) # Conflicts: # core/src/retransmit_stage.rs * returns completed-data-set-info from insert_data_shred instead of opaque (u32, u32) which are then converted to CompletedDataSetInfo at the call-site. (cherry picked from commit 3c71670) # Conflicts: # ledger/src/blockstore.rs * retransmits shreds recovered from erasure codes Shreds recovered from erasure codes have not been received from turbine and have not been retransmitted to other nodes downstream. This results in more repairs across the cluster which is slower. This commit channels through recovered shreds to retransmit stage in order to further broadcast the shreds to downstream nodes in the tree. (cherry picked from commit 7a8807b) # Conflicts: # core/src/retransmit_stage.rs # core/src/window_service.rs * removes backport merge conflicts Co-authored-by: behzad nouri <behzadnouri@gmail.com>
test_skip_repair in retransmit-stage is no longer relevant because following: solana-labs#19233 repair packets are filtered out earlier in window-service and so retransmit stage does not know if a shred is repaired or not. Also, following turbine peer shuffle changes: solana-labs#24080 the test has become flaky since it does not take into account how peers are shuffled for each shred.
…4121) test_skip_repair in retransmit-stage is no longer relevant because following: #19233 repair packets are filtered out earlier in window-service and so retransmit stage does not know if a shred is repaired or not. Also, following turbine peer shuffle changes: #24080 the test has become flaky since it does not take into account how peers are shuffled for each shred.
…4121) test_skip_repair in retransmit-stage is no longer relevant because following: #19233 repair packets are filtered out earlier in window-service and so retransmit stage does not know if a shred is repaired or not. Also, following turbine peer shuffle changes: #24080 the test has become flaky since it does not take into account how peers are shuffled for each shred. (cherry picked from commit 2282571)
…4121) (#24126) test_skip_repair in retransmit-stage is no longer relevant because following: #19233 repair packets are filtered out earlier in window-service and so retransmit stage does not know if a shred is repaired or not. Also, following turbine peer shuffle changes: #24080 the test has become flaky since it does not take into account how peers are shuffled for each shred. (cherry picked from commit 2282571) Co-authored-by: behzad nouri <behzadnouri@gmail.com>
…4121) test_skip_repair in retransmit-stage is no longer relevant because following: #19233 repair packets are filtered out earlier in window-service and so retransmit stage does not know if a shred is repaired or not. Also, following turbine peer shuffle changes: #24080 the test has become flaky since it does not take into account how peers are shuffled for each shred. (cherry picked from commit 2282571) # Conflicts: # core/src/retransmit_stage.rs
…ckport #24121) (#24663) * removes outdated and flaky test_skip_repair from retransmit-stage (#24121) test_skip_repair in retransmit-stage is no longer relevant because following: #19233 repair packets are filtered out earlier in window-service and so retransmit stage does not know if a shred is repaired or not. Also, following turbine peer shuffle changes: #24080 the test has become flaky since it does not take into account how peers are shuffled for each shred. (cherry picked from commit 2282571) # Conflicts: # core/src/retransmit_stage.rs * removes mergify merge conflicts Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Problem
Shreds recovered from erasure codes have not been received from turbine
and have not been retransmitted to other nodes downstream. This results
in more repairs across the cluster which is slower.
Summary of Changes