[Merged by Bors] - Fallback nodes for eth1 access #1918

blacktemplar · 2020-11-16T15:26:51Z

Issue Addressed

part of #1883

Proposed Changes

Adds a new cli argument --eth1-endpoints that can be used instead of --eth1-endpoint to specify a comma-separated list of endpoints. If the first endpoint returns an error for some request the other endpoints are tried in the given order.

Additional Info

Currently if the first endpoint fails the fallbacks are used silently (except for try_fallback_test_endpoint that is used in do_update which logs a WARN for each endpoint that is not reachable). A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.

pawanjay176

LGTM. The macro is pretty neat!
Just a few minor nits.

A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.

Agree we should add extra logs.
Another nice to have imo would be to add metrics for num_correct_responses/num_requests for each passed endpoint. That way the user can set alerts for failing endpoints.

pawanjay176 · 2020-11-17T15:41:14Z

beacon_node/src/cli.rs

                .help("Specifies the server for a web3 connection to the Eth1 chain. Also enables the --eth1 flag. Defaults to http://127.0.0.1:8545.")
                .takes_value(true)
        )
+        .arg(


We could remove eth1-endpoint and just keep eth1-endpoints since eth1-endpoints since endpoints is applicable for both single and multiple endpoints.

I didn't remove eth1-endpoint to stay backwards-compatible, but we can remove that in the next major release...

beacon_node/eth1/src/service.rs

blacktemplar · 2020-11-17T17:06:05Z

LGTM. The macro is pretty neat!
Just a few minor nits.

A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.

Agree we should add extra logs.
Another nice to have imo would be to add metrics for num_correct_responses/num_requests for each passed endpoint. That way the user can set alerts for failing endpoints.

Currently the eth1::http module is quite pure and does not consider logging or metrics, this is the reason why I didn't add it yet. I am not sure if more logging is useful anyway since do_update is called regularly and if an endpoint is down for longer time this will result in warning logs. The metrics would be more interesting I think and shouldn't be a problem to add since its realized via globals anyway (so we don't need to pass anything into the macro or something).

# Conflicts: # beacon_node/eth1/src/service.rs

blacktemplar · 2020-11-18T12:26:49Z

I just realized a problem with the current approach: In the update we check the network id and chain id currently with try fallback. If the first endpoint uses the wrong network id or chain id and the second the correct one then the check will succeed with a WARN log line. But for all the following requests the first endpoint gets called first which may give wrong results because of the wrong network id or chain id.

To fix this we have three possibilities:

Use the fallback_on_err macro over the whole do_update function. This has the drawback that if we have to download a lot of eth1 blocks in the update and lets say one of the last requests fails for some reason, then we redo the whole update and redownload all the blocks with the next fallback endpoint.
Do some sort of lazy testing of the endpoints. This is more work to realize but the idea is that during the whole do_update function we remember for each endpoint if it got tested already or not. Whenever we access an endpoint within this function call tree we check if it got already tested and if not we test it (i.e. check if it is reachable and if network id and chain id are correct).
At the beginning of do_update check all endpoints and remove those that are not reachable or have the wrong network id or chain id. This is easier than 2. but may result in unnecessary requests to the fallbacks when they are not used during the update.

I am currently preparing 2. since I think its the best solution but if you disagree please tell me ;).

…not synced

# Conflicts: # beacon_node/src/cli.rs

blacktemplar · 2020-11-20T11:13:14Z

This is now ready for review. A few notes:

I had to refactor a few parts of the eth1 service so that requests are grouped together before using fallbacks. This is important since if we have a request A and a following request B and we only know if A was successful after processing B we should group them together and process both with one endpoint and if one of both is not successful fallback. On the other hand we only want to do the minimal necessary grouping since otherwise we may redo a lot of unnecessary requests when falling back.
The endpoint's network_id and chain_id is now always checked once for each used endpoint, when calling any of the methods that expects an endpoint list. This implies semantic changes for eth1_genesis_service but I think it is also good to check the network_id and chain_id there.
I introduced now a simple mechanism to detect that a node fall behind and we should switch to a fallback node. At the beginning of the update method we download the latest block from the endpoint (this didn't change in this PR). We can then use this block to check if its timestamp is far in the past. To specify "far" I added a config value node_far_behind_seconds which means if a timestamp of the latest block of an endpoint is in the past more than node_far_behind_seconds seconds we consider that endpoint to be out of sync and fallback. This config value is currently initialized with max(5, spec.eth1_follow_distance / 2) * spec.seconds_per_eth1_block (note that the max is only for degenerated specs).

paulhauner · 2020-11-21T01:28:45Z

beacon_node/eth1/src/service.rs

+/// endpoint. If no endpoint is usable it returns None. Usability of endpoints is checked lazily
+/// using the EndpointsCache structure. For each endpoint if it returns an error the on_err function
+/// is called.
+macro_rules! with_fallback_and_on_err {


The macro is cool, but I'm not convinced that we need to use a macro here. Generally, I find not writing a macro is the more maintainable and extensible solution; compile errors are simpler, imports/exports are easier and the types/generics are more clearly expressed. Of course, there are times where macros are needed or are simpler.

This is how I would go about this problem without templating/macros: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=564a7fc931b10cfc757f752afe261604

Considering that we're going to need this same fallback logic in the VC, I would suggest to use a struct-based approach (like in the the play above) defined in a new crate in common/ that can be shared between validator_client/ and beacon_node/eth1.

You can even include the validator client fallback in this PR, if you like like :)

thanks for the detailed review. I wanted to go a similar route at the beginning but then probably abandoned the idea too early after having problems integrating everything in an async way. I will have a try at that again...

# Conflicts: # beacon_node/eth1/src/service.rs # lcli/src/eth1_genesis.rs

paulhauner

Looks good, apart from the unfortunate merge conflict :)

paulhauner · 2020-11-27T00:18:21Z

common/fallback/src/lib.rs

+        Err(FallbackError::AllErrored(errors))
+    }
+
+    pub fn map_format_error<'a, E, F, S>(&'a self, f: F, error: &FallbackError<E>) -> String


It took me a little while to understand what this doing, but it looks like it does it faithfully.

An alternate approach could be based upon this rough sketch:

impl<T: Display> Fallback<T> { pub fn format_error(&self, error: &FallbackError<E>) -> String { .. } } impl Display for EndpointWithState { .. }

I won't block on this though, it works well as is :)

I didn't do that yet, since I will change that part anyway for restructuring for the bn case.

paulhauner · 2020-11-27T01:06:03Z

beacon_node/eth1/src/service.rs

+        .await
+        .map_err(error_connecting)?;
+    if &network_id != config_network_id {
+        warn!(


This is in conflict with #1981 unfortunately. I think what we need to do is return some error if we get chain_id ==0 so that we try again.

I am returning now an EndpintError::FarBehind error + logging a warning.

# Conflicts: # beacon_node/eth1/src/service.rs

AgeManning · 2020-11-27T08:37:23Z

bors r+

## Issue Addressed part of #1883 ## Proposed Changes Adds a new cli argument `--eth1-endpoints` that can be used instead of `--eth1-endpoint` to specify a comma-separated list of endpoints. If the first endpoint returns an error for some request the other endpoints are tried in the given order. ## Additional Info Currently if the first endpoint fails the fallbacks are used silently (except for `try_fallback_test_endpoint` that is used in `do_update` which logs a `WARN` for each endpoint that is not reachable). A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.

paulhauner

Great, thanks @blacktemplar. I think users will really appreciate this one.

I'll publish a release in the coming hours :)

bors · 2020-11-27T09:42:32Z

Pull request successfully merged into unstable.

Build succeeded:

implements fallback nodes for eth1 access

1966042

AgeManning requested review from paulhauner and pawanjay176 November 17, 2020 02:26

pawanjay176 approved these changes Nov 17, 2020

View reviewed changes

correct comment for test_endpoint

38c68da

blacktemplar added 2 commits November 18, 2020 07:01

Merge branch 'master' into eth1-fallbacks

8de8e6c

# Conflicts: # beacon_node/eth1/src/service.rs

add metrics for failed and total eth1 requests

ce861ff

blacktemplar added the work-in-progress PR is a work-in-progress label Nov 18, 2020

blacktemplar added 9 commits November 19, 2020 10:57

restructure eth1 service + fallbacks work now if an eth1 endpoint is …

1c45ba5

…not synced

cargo fmt

5e27238

Merge branch 'master' into eth1-fallbacks

e1fca12

# Conflicts: # beacon_node/src/cli.rs

cleanup

0d7e1c1

fix tests

becbb27

add unit tests

4715579

fix simulator

292dfbc

fix tests

285e095

invalidate eth1 endpoints if they are far behind

e442376

blacktemplar added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Nov 20, 2020

blacktemplar requested a review from pawanjay176 November 20, 2020 11:13

paulhauner reviewed Nov 21, 2020

View reviewed changes

use struct based approach instead of macros

c43da46

paulhauner changed the base branch from master to unstable November 23, 2020 23:20

Merge branch 'unstable' into eth1-fallbacks

0ba754f

# Conflicts: # beacon_node/eth1/src/service.rs # lcli/src/eth1_genesis.rs

paulhauner approved these changes Nov 27, 2020

View reviewed changes

paulhauner added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Nov 27, 2020

jerryji mentioned this pull request Nov 27, 2020

Backup ETH1 Provider Support OffchainLabs/prysm#7976

Closed

Merge branch 'unstable' into eth1-fallbacks

93fa639

# Conflicts: # beacon_node/eth1/src/service.rs

AgeManning approved these changes Nov 27, 2020

View reviewed changes

AgeManning added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Nov 27, 2020

paulhauner approved these changes Nov 27, 2020

View reviewed changes

bors bot changed the title ~~Fallback nodes for eth1 access~~ [Merged by Bors] - Fallback nodes for eth1 access Nov 27, 2020

bors bot closed this Nov 27, 2020

[Merged by Bors] - Fallback nodes for eth1 access #1918

[Merged by Bors] - Fallback nodes for eth1 access #1918

Uh oh!

Conversation

blacktemplar commented Nov 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Additional Info

Uh oh!

pawanjay176 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blacktemplar commented Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blacktemplar commented Nov 18, 2020

Uh oh!

blacktemplar commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blacktemplar Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulhauner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AgeManning commented Nov 27, 2020

Uh oh!

paulhauner left a comment

Choose a reason for hiding this comment

Uh oh!

bors bot commented Nov 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blacktemplar commented Nov 16, 2020 •

edited

Loading

blacktemplar commented Nov 17, 2020 •

edited

Loading

blacktemplar commented Nov 20, 2020 •

edited

Loading

blacktemplar Nov 23, 2020 •

edited

Loading