Skip to content

Conversation

@paulhauner
Copy link
Member

@paulhauner paulhauner commented Dec 11, 2020

Issue Addressed

Proposed Changes

This follows on from @blacktemplar's work in #2018.

  • Allows the VC to connect to multiple BN for redundancy.
    • Update the simulator so some nodes always need to rely on their fallback.
  • Adds some extra deprecation warnings for --eth1-endpoint
  • Pass SignatureBytes as a reference instead of by value.

Additional Info

NA

blacktemplar and others added 17 commits November 30, 2020 13:30
…fallback module + adapt usage for eth1 nodes
# Conflicts:
#	beacon_node/eth1/src/service.rs
#	validator_client/src/attestation_service.rs
#	validator_client/src/block_service.rs
#	validator_client/src/duties_service.rs
#	validator_client/src/fork_service.rs
# Conflicts:
#	Cargo.lock
… types to distinguish the two levels of fallback error memory + Offline state gets now remembered together with the Sync state + adds invalid endpoints to some validators in the simulator to test fallback logic
…coverable error remembers that node is offline + check remembered offline state when trying to reapply to unsynced nodes if allow_unsynced is true
@paulhauner paulhauner added the work-in-progress PR is a work-in-progress label Dec 11, 2020
@paulhauner paulhauner mentioned this pull request Dec 11, 2020
2 tasks
@paulhauner paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Dec 11, 2020
Copy link
Contributor

@blacktemplar blacktemplar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general works for me, proposing some smaller changes.


/// Indicates if a beacon node must be synced before some action is performed on it.
#[derive(PartialEq, Clone, Copy)]
pub enum RequireSynced {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why to use a custom enum vs just a boolean? We already use a boolean to store the information in the duties service (where it gets converted to this enum at some point).

Copy link
Member Author

@paulhauner paulhauner Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a zero-cost abstraction and I find it much clearer to read some_function(RequireSynced::Yes, x) than some_function(true, x).

It also ensures type-safety if you add another things: some_function(RequireSynced::No, PreferSynced::Yes, thing) vs some_function(false, true, x).

I find bools are fine when referred to by variable name, but I think having misc true and false floating around quickly damages readability, which is valuable on a large project with many contributors.

// when the status does not require refreshing anymore. This deemed is an
// acceptable inefficiency.
let _ = candidate
.refresh_status(self.slot_clock.as_ref(), &self.spec, &self.log)
Copy link
Contributor

@blacktemplar blacktemplar Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth doing that in parallel? We are already in the async context so would only need to call join_all for the refresh_status futures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I've cherry-picked 02c2289

if let Err(e) = candidate.status(require_synced).await {
// This client was not ready on the first pass, we might try it again later.
to_retry.push(candidate);
errors.push((candidate.beacon_node.to_string(), Error::Unavailable(e)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it wanted that there will be two errors for this candidate in the errors vector? We could change the errors vector a HashMap to only return one error per candidate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think two errors is fine. We get to see why it failed the first time and then why it failed the second time, this is useful information. E.g., it was offline, then it came online but returned an error.

}
};

if let Err(e) = new_status {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If require_synced == RequireSync::No shouldn't we ignore here a sync error and proceed with the else branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I've addressed this in f6e1486.


if let Err(e) = new_status {
errors.push((candidate.beacon_node.to_string(), Error::Unavailable(e)));
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we deduplicate that else block somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I didn't notice they're exactly the same. I've addressed this in f6e1486.

if let Some(slot_clock) = slot_clock {
match check_synced(&self.beacon_node, slot_clock, Some(log)).await {
Ok(_) => Ok(()),
Err(_) => Err(CandidateError::NotSynced),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a warn log here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I've cherry-picked 0e11b02

Copy link
Contributor

@blacktemplar blacktemplar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small addition


// First pass: try `func` on all ready candidates.
for candidate in &self.candidates {
if let Err(e) = candidate.status(require_synced).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I forgot in my first review of your proposal to mention this sementically big difference: The question is what should happen if we don't require synced. For example for block proposal/attestation we still would like to prefer synced nodes. Therefore in my proposed fallback implementation I had the following logic: first try all nodes that are synced, then if all errored or no such node exists and if require synced is false, then also try all unsynced nodes. I am not sure how the create block / attest endpoints behave if the node is not synced but if there is a chance that they don't error when unsynced I think we should prefer synced endpoints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is a useful feature!

I think that blacktemplar@b0bebaf also has some side effects, though. Consider this scenario:

  • The VC has 2x BN; BN1 & BN2.
  • BN1 fails after a timeout.
  • BN2 not synced.

Now, we make a request that has RequireSynced::No. The following will happen:

  • Both BNs will fail on the first check.
  • The second check for BN1 will fail with a timeout
  • The second check for BN2 will succeed.

In this scenario we've added an unnecessary cost of one timeout to this call.

I think I've managed to avoid the extra cost in f6e1486, let me know what you think :)

Copy link
Contributor

@blacktemplar blacktemplar Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Paul, in your current solution I first didn't understand why rerunning all online candidates in the second pass and I wrote this: e469e85

but then I understood that if in the first run all candidates failed then there are no "online" candidates and all candidates are either unsynced or otherwise unavailable. So I think your solution is good. What we should consider (and maybe mention in a comment) what happens if the online status changes for some candidates between the first and the second pass (due to asynchronicity). Then we might rerun in the second pass the func for some candidates. Is that wanted behavior or not (if not we could use the changes I proposed in e469e85).

Otherwise, I am now very happy with the current solution :).

Unfortunately, we have a fmt error (might be from my cherry picked changes, I am so sorry) and I can't push to your branch.

Edit: Just saw that in your current solution we would add offline candidates twice to to_retry and therefore check them twice in the third pass. Maybe my proposed changes in e469e85 are not so bad after all (I think they make the behavior also more explicit).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw that in your current solution we would add offline candidates twice to to_retry and therefore check them twice in the third pass.

Good catch! Thank you :)

Maybe my proposed changes in e469e85 are not so bad after all (I think they make the behavior also more explicit).

I agree, it is more explicit and I like it. I've cherry-picked that comment. However, I did also add d17d36a since your changes meant that we don't try to re-check the nodes sync-status during the third loop. Let me know if you see any problems with this.

I can't believe how fiddly this task is! 😅

Copy link
Contributor

@blacktemplar blacktemplar Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought rechecking the status is not necessary for unsynced nodes, since they get executed in the second pass anyway, regardless what their (real) status is. So if we get to the third pass that means the unsynced nodes all errored and their status gets set to offline (maybe the calling function errors if the bn is too far behind), then I am not sure if it is useful to do the same call again if the node is still unsynced.

So basically if we allow unsynced nodes this means for me (after trying all synced nodes) unsynced is an accepted status. Therefore we don't recheck the status for them (we also don't recheck the status of all synced nodes if they error the first time).

What we could think about: Is it correct behavior to always assume the status is offline if the function errors? As described above it could happen that an unsynced node gets set to offline because the function errors if a node is too far behind. This would result in the next call (with allow_unsynced == true) that the node is considered offline and therefore only checked in the third pass. But maybe that is acceptable in this scenario?

PS: Can you merge in the unstable branch to let CI run again :)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought rechecking the status is not necessary for unsynced nodes, since they get executed in the second pass anyway, regardless what their (real) status is.

Oh yeah, you're right. I missed this! I'll remove that extra line I added.

Is it correct behavior to always assume the status is offline if the function errors? As described above it could happen that an unsynced node gets set to offline because the function errors if a node is too far behind. This would result in the next call (with allow_unsynced == true) that the node is considered offline and therefore only checked in the third pass. But maybe that is acceptable in this scenario?

Yeah, it isn't ideal that we just set them to offline regardless of the failure type. The alternative to this is trying to match on the reqwest errors, but I think we might end up opening a can of worms there. It's not ideal in the scenario you described, but it eventually works out in the end so I think we can deem it acceptable. Let me know if you disagree.

PS: Can you merge in the unstable branch to let CI run again :)?

Done! Correct me if I'm wrong, but I think we'd be in a good place to get this merged, once CI passes. Yay!

@blacktemplar
Copy link
Contributor

blacktemplar commented Dec 11, 2020

Since you are on vacation I created a new branch where I incorporated all my proposed changes to this PR (one commit per proposed change). When you read my comments you can either merge or cherry pick from there: https://github.com/blacktemplar/lighthouse/tree/ph-bn-fallback-proposed-changes

I addressed all my comments except the duplicate else block one (this is kinda hard with the return in there and probably not worth it).

Edit: created a PR to your branch just to see if CI runs through: #2081

@blacktemplar blacktemplar added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Dec 11, 2020
@paulhauner
Copy link
Member Author

I've address all the comments, thank you for the detailed review @blacktemplar!

I'll flag this as ready-to-review for now and come back to check CI later :)

@paulhauner paulhauner added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Dec 15, 2020
@blacktemplar blacktemplar added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Dec 16, 2020
@paulhauner paulhauner added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Dec 17, 2020
Copy link
Contributor

@blacktemplar blacktemplar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay I think we are ready for bors :).

@blacktemplar blacktemplar added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Dec 18, 2020
@paulhauner paulhauner marked this pull request as ready for review December 18, 2020 09:16
@paulhauner
Copy link
Member Author

Thank you @blacktemplar, much appreciated! I think users will love this one (I know I will!).

bors r+

bors bot pushed a commit that referenced this pull request Dec 18, 2020
## Issue Addressed

- Resolves #1883

## Proposed Changes

This follows on from @blacktemplar's work in #2018.

- Allows the VC to connect to multiple BN for redundancy.
  - Update the simulator so some nodes always need to rely on their fallback.
- Adds some extra deprecation warnings for `--eth1-endpoint`
- Pass `SignatureBytes` as a reference instead of by value.

## Additional Info

NA

Co-authored-by: blacktemplar <blacktemplar@a1.net>
@bors
Copy link

bors bot commented Dec 18, 2020

@bors bors bot changed the title BN Fallback v2 [Merged by Bors] - BN Fallback v2 Dec 18, 2020
@bors bors bot closed this Dec 18, 2020
@realbigsean realbigsean mentioned this pull request Jan 14, 2021
@paulhauner paulhauner deleted the ph-bn-fallback branch January 20, 2021 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge This PR is ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants