-
Notifications
You must be signed in to change notification settings - Fork 946
[Merged by Bors] - BN Fallback v2 #2080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…fallback module + adapt usage for eth1 nodes
# Conflicts: # beacon_node/eth1/src/service.rs # validator_client/src/attestation_service.rs # validator_client/src/block_service.rs # validator_client/src/duties_service.rs # validator_client/src/fork_service.rs
# Conflicts: # Cargo.lock
… types to distinguish the two levels of fallback error memory + Offline state gets now remembered together with the Sync state + adds invalid endpoints to some validators in the simulator to test fallback logic
…coverable error remembers that node is offline + check remembered offline state when trying to reapply to unsynced nodes if allow_unsynced is true
blacktemplar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general works for me, proposing some smaller changes.
|
|
||
| /// Indicates if a beacon node must be synced before some action is performed on it. | ||
| #[derive(PartialEq, Clone, Copy)] | ||
| pub enum RequireSynced { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason why to use a custom enum vs just a boolean? We already use a boolean to store the information in the duties service (where it gets converted to this enum at some point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a zero-cost abstraction and I find it much clearer to read some_function(RequireSynced::Yes, x) than some_function(true, x).
It also ensures type-safety if you add another things: some_function(RequireSynced::No, PreferSynced::Yes, thing) vs some_function(false, true, x).
I find bools are fine when referred to by variable name, but I think having misc true and false floating around quickly damages readability, which is valuable on a large project with many contributors.
| // when the status does not require refreshing anymore. This deemed is an | ||
| // acceptable inefficiency. | ||
| let _ = candidate | ||
| .refresh_status(self.slot_clock.as_ref(), &self.spec, &self.log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth doing that in parallel? We are already in the async context so would only need to call join_all for the refresh_status futures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I've cherry-picked 02c2289
| if let Err(e) = candidate.status(require_synced).await { | ||
| // This client was not ready on the first pass, we might try it again later. | ||
| to_retry.push(candidate); | ||
| errors.push((candidate.beacon_node.to_string(), Error::Unavailable(e))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it wanted that there will be two errors for this candidate in the errors vector? We could change the errors vector a HashMap to only return one error per candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think two errors is fine. We get to see why it failed the first time and then why it failed the second time, this is useful information. E.g., it was offline, then it came online but returned an error.
| } | ||
| }; | ||
|
|
||
| if let Err(e) = new_status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If require_synced == RequireSync::No shouldn't we ignore here a sync error and proceed with the else branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I've addressed this in f6e1486.
|
|
||
| if let Err(e) = new_status { | ||
| errors.push((candidate.beacon_node.to_string(), Error::Unavailable(e))); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we deduplicate that else block somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, I didn't notice they're exactly the same. I've addressed this in f6e1486.
| if let Some(slot_clock) = slot_clock { | ||
| match check_synced(&self.beacon_node, slot_clock, Some(log)).await { | ||
| Ok(_) => Ok(()), | ||
| Err(_) => Err(CandidateError::NotSynced), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a warn log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I've cherry-picked 0e11b02
blacktemplar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small addition
|
|
||
| // First pass: try `func` on all ready candidates. | ||
| for candidate in &self.candidates { | ||
| if let Err(e) = candidate.status(require_synced).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I forgot in my first review of your proposal to mention this sementically big difference: The question is what should happen if we don't require synced. For example for block proposal/attestation we still would like to prefer synced nodes. Therefore in my proposed fallback implementation I had the following logic: first try all nodes that are synced, then if all errored or no such node exists and if require synced is false, then also try all unsynced nodes. I am not sure how the create block / attest endpoints behave if the node is not synced but if there is a chance that they don't error when unsynced I think we should prefer synced endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is a useful feature!
I think that blacktemplar@b0bebaf also has some side effects, though. Consider this scenario:
- The VC has 2x BN; BN1 & BN2.
- BN1 fails after a timeout.
- BN2 not synced.
Now, we make a request that has RequireSynced::No. The following will happen:
- Both BNs will fail on the first check.
- The second check for BN1 will fail with a timeout
- The second check for BN2 will succeed.
In this scenario we've added an unnecessary cost of one timeout to this call.
I think I've managed to avoid the extra cost in f6e1486, let me know what you think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Paul, in your current solution I first didn't understand why rerunning all online candidates in the second pass and I wrote this: e469e85
but then I understood that if in the first run all candidates failed then there are no "online" candidates and all candidates are either unsynced or otherwise unavailable. So I think your solution is good. What we should consider (and maybe mention in a comment) what happens if the online status changes for some candidates between the first and the second pass (due to asynchronicity). Then we might rerun in the second pass the func for some candidates. Is that wanted behavior or not (if not we could use the changes I proposed in e469e85).
Otherwise, I am now very happy with the current solution :).
Unfortunately, we have a fmt error (might be from my cherry picked changes, I am so sorry) and I can't push to your branch.
Edit: Just saw that in your current solution we would add offline candidates twice to to_retry and therefore check them twice in the third pass. Maybe my proposed changes in e469e85 are not so bad after all (I think they make the behavior also more explicit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw that in your current solution we would add offline candidates twice to to_retry and therefore check them twice in the third pass.
Good catch! Thank you :)
Maybe my proposed changes in e469e85 are not so bad after all (I think they make the behavior also more explicit).
I agree, it is more explicit and I like it. I've cherry-picked that comment. However, I did also add d17d36a since your changes meant that we don't try to re-check the nodes sync-status during the third loop. Let me know if you see any problems with this.
I can't believe how fiddly this task is! 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought rechecking the status is not necessary for unsynced nodes, since they get executed in the second pass anyway, regardless what their (real) status is. So if we get to the third pass that means the unsynced nodes all errored and their status gets set to offline (maybe the calling function errors if the bn is too far behind), then I am not sure if it is useful to do the same call again if the node is still unsynced.
So basically if we allow unsynced nodes this means for me (after trying all synced nodes) unsynced is an accepted status. Therefore we don't recheck the status for them (we also don't recheck the status of all synced nodes if they error the first time).
What we could think about: Is it correct behavior to always assume the status is offline if the function errors? As described above it could happen that an unsynced node gets set to offline because the function errors if a node is too far behind. This would result in the next call (with allow_unsynced == true) that the node is considered offline and therefore only checked in the third pass. But maybe that is acceptable in this scenario?
PS: Can you merge in the unstable branch to let CI run again :)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought rechecking the status is not necessary for unsynced nodes, since they get executed in the second pass anyway, regardless what their (real) status is.
Oh yeah, you're right. I missed this! I'll remove that extra line I added.
Is it correct behavior to always assume the status is offline if the function errors? As described above it could happen that an unsynced node gets set to offline because the function errors if a node is too far behind. This would result in the next call (with allow_unsynced == true) that the node is considered offline and therefore only checked in the third pass. But maybe that is acceptable in this scenario?
Yeah, it isn't ideal that we just set them to offline regardless of the failure type. The alternative to this is trying to match on the reqwest errors, but I think we might end up opening a can of worms there. It's not ideal in the scenario you described, but it eventually works out in the end so I think we can deem it acceptable. Let me know if you disagree.
PS: Can you merge in the unstable branch to let CI run again :)?
Done! Correct me if I'm wrong, but I think we'd be in a good place to get this merged, once CI passes. Yay!
|
Since you are on vacation I created a new branch where I incorporated all my proposed changes to this PR (one commit per proposed change). When you read my comments you can either merge or cherry pick from there: https://github.com/blacktemplar/lighthouse/tree/ph-bn-fallback-proposed-changes I addressed all my comments except the duplicate else block one (this is kinda hard with the return in there and probably not worth it). Edit: created a PR to your branch just to see if CI runs through: #2081 |
|
I've address all the comments, thank you for the detailed review @blacktemplar! I'll flag this as ready-to-review for now and come back to check CI later :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay I think we are ready for bors :).
|
Thank you @blacktemplar, much appreciated! I think users will love this one (I know I will!). bors r+ |
## Issue Addressed - Resolves #1883 ## Proposed Changes This follows on from @blacktemplar's work in #2018. - Allows the VC to connect to multiple BN for redundancy. - Update the simulator so some nodes always need to rely on their fallback. - Adds some extra deprecation warnings for `--eth1-endpoint` - Pass `SignatureBytes` as a reference instead of by value. ## Additional Info NA Co-authored-by: blacktemplar <blacktemplar@a1.net>
|
Pull request successfully merged into unstable. Build succeeded: |
Issue Addressed
Proposed Changes
This follows on from @blacktemplar's work in #2018.
--eth1-endpointSignatureBytesas a reference instead of by value.Additional Info
NA