[Merged by Bors] - BN Fallback v2 #2080

paulhauner · 2020-12-11T04:47:17Z

Issue Addressed

Resolves Fallback eth1 node and beacon node #1883

Proposed Changes

This follows on from @blacktemplar's work in #2018.

Allows the VC to connect to multiple BN for redundancy.
- Update the simulator so some nodes always need to rely on their fallback.
Adds some extra deprecation warnings for --eth1-endpoint
Pass SignatureBytes as a reference instead of by value.

Additional Info

NA

…fallback module + adapt usage for eth1 nodes

# Conflicts: # beacon_node/eth1/src/service.rs # validator_client/src/attestation_service.rs # validator_client/src/block_service.rs # validator_client/src/duties_service.rs # validator_client/src/fork_service.rs

# Conflicts: # Cargo.lock

… types to distinguish the two levels of fallback error memory + Offline state gets now remembered together with the Sync state + adds invalid endpoints to some validators in the simulator to test fallback logic

…coverable error remembers that node is offline + check remembered offline state when trying to reapply to unsynced nodes if allow_unsynced is true

…fallback

blacktemplar

In general works for me, proposing some smaller changes.

blacktemplar · 2020-12-11T08:51:42Z

validator_client/src/beacon_node_fallback.rs

+
+/// Indicates if a beacon node must be synced before some action is performed on it.
+#[derive(PartialEq, Clone, Copy)]
+pub enum RequireSynced {


Any particular reason why to use a custom enum vs just a boolean? We already use a boolean to store the information in the duties service (where it gets converted to this enum at some point).

It's a zero-cost abstraction and I find it much clearer to read some_function(RequireSynced::Yes, x) than some_function(true, x).

It also ensures type-safety if you add another things: some_function(RequireSynced::No, PreferSynced::Yes, thing) vs some_function(false, true, x).

I find bools are fine when referred to by variable name, but I think having misc true and false floating around quickly damages readability, which is valuable on a large project with many contributors.

blacktemplar · 2020-12-11T09:07:45Z

validator_client/src/beacon_node_fallback.rs

+                // when the status does not require refreshing anymore. This deemed is an
+                // acceptable inefficiency.
+                let _ = candidate
+                    .refresh_status(self.slot_clock.as_ref(), &self.spec, &self.log)


Is it worth doing that in parallel? We are already in the async context so would only need to call join_all for the refresh_status futures.

Good call, I've cherry-picked 02c2289

blacktemplar · 2020-12-11T09:13:01Z

validator_client/src/beacon_node_fallback.rs

+            if let Err(e) = candidate.status(require_synced).await {
+                // This client was not ready on the first pass, we might try it again later.
+                to_retry.push(candidate);
+                errors.push((candidate.beacon_node.to_string(), Error::Unavailable(e)));


Is it wanted that there will be two errors for this candidate in the errors vector? We could change the errors vector a HashMap to only return one error per candidate.

I think two errors is fine. We get to see why it failed the first time and then why it failed the second time, this is useful information. E.g., it was offline, then it came online but returned an error.

blacktemplar · 2020-12-11T09:23:44Z

validator_client/src/beacon_node_fallback.rs

+                }
+            };
+
+            if let Err(e) = new_status {


If require_synced == RequireSync::No shouldn't we ignore here a sync error and proceed with the else branch?

Nice catch! I've addressed this in f6e1486.

blacktemplar · 2020-12-11T09:25:00Z

validator_client/src/beacon_node_fallback.rs

+
+            if let Err(e) = new_status {
+                errors.push((candidate.beacon_node.to_string(), Error::Unavailable(e)));
+            } else {


can we deduplicate that else block somehow?

Yeah good point, I didn't notice they're exactly the same. I've addressed this in f6e1486.

blacktemplar · 2020-12-11T09:33:49Z

validator_client/src/beacon_node_fallback.rs

+        if let Some(slot_clock) = slot_clock {
+            match check_synced(&self.beacon_node, slot_clock, Some(log)).await {
+                Ok(_) => Ok(()),
+                Err(_) => Err(CandidateError::NotSynced),


I think we should add a warn log here.

Nice, I've cherry-picked 0e11b02

blacktemplar

small addition

blacktemplar · 2020-12-11T10:18:31Z

validator_client/src/beacon_node_fallback.rs

+
+        // First pass: try `func` on all ready candidates.
+        for candidate in &self.candidates {
+            if let Err(e) = candidate.status(require_synced).await {


I think I forgot in my first review of your proposal to mention this sementically big difference: The question is what should happen if we don't require synced. For example for block proposal/attestation we still would like to prefer synced nodes. Therefore in my proposed fallback implementation I had the following logic: first try all nodes that are synced, then if all errored or no such node exists and if require synced is false, then also try all unsynced nodes. I am not sure how the create block / attest endpoints behave if the node is not synced but if there is a chance that they don't error when unsynced I think we should prefer synced endpoints.

Good catch, this is a useful feature!

I think that blacktemplar@b0bebaf also has some side effects, though. Consider this scenario:

The VC has 2x BN; BN1 & BN2.

BN1 fails after a timeout.

BN2 not synced.

Now, we make a request that has RequireSynced::No. The following will happen:

Both BNs will fail on the first check.

The second check for BN1 will fail with a timeout

The second check for BN2 will succeed.

In this scenario we've added an unnecessary cost of one timeout to this call.

I think I've managed to avoid the extra cost in f6e1486, let me know what you think :)

Hey Paul, in your current solution I first didn't understand why rerunning all online candidates in the second pass and I wrote this: e469e85

but then I understood that if in the first run all candidates failed then there are no "online" candidates and all candidates are either unsynced or otherwise unavailable. So I think your solution is good. What we should consider (and maybe mention in a comment) what happens if the online status changes for some candidates between the first and the second pass (due to asynchronicity). Then we might rerun in the second pass the func for some candidates. Is that wanted behavior or not (if not we could use the changes I proposed in e469e85).

Otherwise, I am now very happy with the current solution :).

Unfortunately, we have a fmt error (might be from my cherry picked changes, I am so sorry) and I can't push to your branch.

Edit: Just saw that in your current solution we would add offline candidates twice to to_retry and therefore check them twice in the third pass. Maybe my proposed changes in e469e85 are not so bad after all (I think they make the behavior also more explicit).

Just saw that in your current solution we would add offline candidates twice to to_retry and therefore check them twice in the third pass.

Good catch! Thank you :)

Maybe my proposed changes in e469e85 are not so bad after all (I think they make the behavior also more explicit).

I agree, it is more explicit and I like it. I've cherry-picked that comment. However, I did also add d17d36a since your changes meant that we don't try to re-check the nodes sync-status during the third loop. Let me know if you see any problems with this.

I can't believe how fiddly this task is! 😅

I thought rechecking the status is not necessary for unsynced nodes, since they get executed in the second pass anyway, regardless what their (real) status is. So if we get to the third pass that means the unsynced nodes all errored and their status gets set to offline (maybe the calling function errors if the bn is too far behind), then I am not sure if it is useful to do the same call again if the node is still unsynced.

So basically if we allow unsynced nodes this means for me (after trying all synced nodes) unsynced is an accepted status. Therefore we don't recheck the status for them (we also don't recheck the status of all synced nodes if they error the first time).

What we could think about: Is it correct behavior to always assume the status is offline if the function errors? As described above it could happen that an unsynced node gets set to offline because the function errors if a node is too far behind. This would result in the next call (with allow_unsynced == true) that the node is considered offline and therefore only checked in the third pass. But maybe that is acceptable in this scenario?

PS: Can you merge in the unstable branch to let CI run again :)?

I thought rechecking the status is not necessary for unsynced nodes, since they get executed in the second pass anyway, regardless what their (real) status is.

Oh yeah, you're right. I missed this! I'll remove that extra line I added.

Is it correct behavior to always assume the status is offline if the function errors? As described above it could happen that an unsynced node gets set to offline because the function errors if a node is too far behind. This would result in the next call (with allow_unsynced == true) that the node is considered offline and therefore only checked in the third pass. But maybe that is acceptable in this scenario?

Yeah, it isn't ideal that we just set them to offline regardless of the failure type. The alternative to this is trying to match on the reqwest errors, but I think we might end up opening a can of worms there. It's not ideal in the scenario you described, but it eventually works out in the end so I think we can deem it acceptable. Let me know if you disagree.

PS: Can you merge in the unstable branch to let CI run again :)?

Done! Correct me if I'm wrong, but I think we'd be in a good place to get this merged, once CI passes. Yay!

blacktemplar · 2020-12-11T11:42:04Z

Since you are on vacation I created a new branch where I incorporated all my proposed changes to this PR (one commit per proposed change). When you read my comments you can either merge or cherry pick from there: https://github.com/blacktemplar/lighthouse/tree/ph-bn-fallback-proposed-changes

I addressed all my comments except the duplicate else block one (this is kinda hard with the return in there and probably not worth it).

Edit: created a PR to your branch just to see if CI runs through: #2081

…nced

paulhauner · 2020-12-15T00:57:55Z

I've address all the comments, thank you for the detailed review @blacktemplar!

I'll flag this as ready-to-review for now and come back to check CI later :)

blacktemplar

Yay I think we are ready for bors :).

paulhauner · 2020-12-18T09:16:45Z

Thank you @blacktemplar, much appreciated! I think users will love this one (I know I will!).

bors r+

@blacktemplar

## Issue Addressed - Resolves #1883 ## Proposed Changes This follows on from @blacktemplar's work in #2018. - Allows the VC to connect to multiple BN for redundancy. - Update the simulator so some nodes always need to rely on their fallback. - Adds some extra deprecation warnings for `--eth1-endpoint` - Pass `SignatureBytes` as a reference instead of by value. ## Additional Info NA Co-authored-by: blacktemplar <blacktemplar@a1.net>

bors · 2020-12-18T10:24:02Z

Pull request successfully merged into unstable.

Build succeeded:

blacktemplar and others added 17 commits November 30, 2020 13:30

implement fallback for beacon node in validator client + restructure …

7fcbee5

…fallback module + adapt usage for eth1 nodes

cargo fmt

64a4877

improve string handling

f3da3e8

add more documentation

1f22d43

Merge branch 'unstable' into bn-fallback

ca670df

# Conflicts: # beacon_node/eth1/src/service.rs # validator_client/src/attestation_service.rs # validator_client/src/block_service.rs # validator_client/src/duties_service.rs # validator_client/src/fork_service.rs

Pauls review comments

43c130a

update tokio dependency

01199c3

Merge branch 'unstable' into bn-fallback

6dca539

# Conflicts: # Cargo.lock

refactor beacon node fallbacks by introducing more fine grained error…

cbb6843

… types to distinguish the two levels of fallback error memory + Offline state gets now remembered together with the Sync state + adds invalid endpoints to some validators in the simulator to test fallback logic

more speaking variable names + don't check unrecoverable errors if re…

278eaba

…coverable error remembers that node is offline + check remembered offline state when trying to reapply to unsynced nodes if allow_unsynced is true

fix tests

4de46a1

fix deadlock bug

ef75987

add invalid eth1 endpoints to beacon nodes in simulator to test eth1 …

0aed3cd

…fallback

cargo fmt

a30beb7

Try alternate implementation

bf1b1f1

Update comments, errors, log message

f157630

Fix clippy lints

f7ca6e7

paulhauner added the work-in-progress PR is a work-in-progress label Dec 11, 2020

paulhauner added 4 commits December 11, 2020 16:06

Start reverting eth1 changes

fa26b60

Merge branch 'unstable' into ph-bn-fallback

1767ae7

Remove eth1_fallback.rs

82c02fc

Fix outdated comment

8d6734d

paulhauner mentioned this pull request Dec 11, 2020

Bn fallback #2018

Closed

2 tasks

paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Dec 11, 2020

paulhauner requested a review from blacktemplar December 11, 2020 05:38

Tidy errors

562be00

blacktemplar suggested changes Dec 11, 2020

View reviewed changes

blacktemplar added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Dec 11, 2020

paulhauner and others added 3 commits December 15, 2020 11:52

Add three-pass system to prefer synced nodes

f6e1486

run candidate updates concurrently

1ad5246

warn for unsynced beacon nodes + return result directly from check_sy…

33c883f

…nced

paulhauner added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Dec 15, 2020

blacktemplar added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Dec 16, 2020

blacktemplar and others added 3 commits December 17, 2020 17:31

only try unsynced candidates in second pass

bbf0d59

Run cargo fmt

08549bf

Retry unsynced candidates, update comment

d17d36a

paulhauner added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Dec 17, 2020

paulhauner added 2 commits December 18, 2020 10:20

Merge branch 'unstable' into ph-bn-fallback

a77c6e3

Remove unsynced nodes from to_retry

d71abf4

blacktemplar approved these changes Dec 18, 2020

View reviewed changes

blacktemplar added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Dec 18, 2020

paulhauner marked this pull request as ready for review December 18, 2020 09:16

bors bot changed the title ~~BN Fallback v2~~ [Merged by Bors] - BN Fallback v2 Dec 18, 2020

bors bot closed this Dec 18, 2020

realbigsean mentioned this pull request Jan 14, 2021

BN failover for VC #1619

Closed

paulhauner deleted the ph-bn-fallback branch January 20, 2021 00:20

[Merged by Bors] - BN Fallback v2 #2080

[Merged by Bors] - BN Fallback v2 #2080

Uh oh!

Conversation

paulhauner commented Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Additional Info

Uh oh!

blacktemplar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulhauner Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blacktemplar Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blacktemplar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blacktemplar Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blacktemplar Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blacktemplar commented Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paulhauner commented Dec 15, 2020

Uh oh!

blacktemplar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulhauner commented Dec 18, 2020

Uh oh!

bors bot commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

paulhauner commented Dec 11, 2020 •

edited

Loading

paulhauner Dec 14, 2020 •

edited

Loading

blacktemplar Dec 11, 2020 •

edited

Loading

blacktemplar Dec 15, 2020 •

edited

Loading

blacktemplar Dec 17, 2020 •

edited

Loading

blacktemplar commented Dec 11, 2020 •

edited

Loading

blacktemplar left a comment •

edited

Loading