Skip to content

Conversation

@blacktemplar
Copy link
Contributor

@blacktemplar blacktemplar commented Nov 16, 2020

Issue Addressed

part of #1883

Proposed Changes

Adds a new cli argument --eth1-endpoints that can be used instead of --eth1-endpoint to specify a comma-separated list of endpoints. If the first endpoint returns an error for some request the other endpoints are tried in the given order.

Additional Info

Currently if the first endpoint fails the fallbacks are used silently (except for try_fallback_test_endpoint that is used in do_update which logs a WARN for each endpoint that is not reachable). A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The macro is pretty neat!
Just a few minor nits.

A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.

Agree we should add extra logs.
Another nice to have imo would be to add metrics for num_correct_responses/num_requests for each passed endpoint. That way the user can set alerts for failing endpoints.

.help("Specifies the server for a web3 connection to the Eth1 chain. Also enables the --eth1 flag. Defaults to http://127.0.0.1:8545.")
.takes_value(true)
)
.arg(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove eth1-endpoint and just keep eth1-endpoints since eth1-endpoints since endpoints is applicable for both single and multiple endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't remove eth1-endpoint to stay backwards-compatible, but we can remove that in the next major release...

@blacktemplar
Copy link
Contributor Author

blacktemplar commented Nov 17, 2020

LGTM. The macro is pretty neat!
Just a few minor nits.

A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.

Agree we should add extra logs.
Another nice to have imo would be to add metrics for num_correct_responses/num_requests for each passed endpoint. That way the user can set alerts for failing endpoints.

Currently the eth1::http module is quite pure and does not consider logging or metrics, this is the reason why I didn't add it yet. I am not sure if more logging is useful anyway since do_update is called regularly and if an endpoint is down for longer time this will result in warning logs. The metrics would be more interesting I think and shouldn't be a problem to add since its realized via globals anyway (so we don't need to pass anything into the macro or something).

@blacktemplar
Copy link
Contributor Author

I just realized a problem with the current approach: In the update we check the network id and chain id currently with try fallback. If the first endpoint uses the wrong network id or chain id and the second the correct one then the check will succeed with a WARN log line. But for all the following requests the first endpoint gets called first which may give wrong results because of the wrong network id or chain id.

To fix this we have three possibilities:

  1. Use the fallback_on_err macro over the whole do_update function. This has the drawback that if we have to download a lot of eth1 blocks in the update and lets say one of the last requests fails for some reason, then we redo the whole update and redownload all the blocks with the next fallback endpoint.
  2. Do some sort of lazy testing of the endpoints. This is more work to realize but the idea is that during the whole do_update function we remember for each endpoint if it got tested already or not. Whenever we access an endpoint within this function call tree we check if it got already tested and if not we test it (i.e. check if it is reachable and if network id and chain id are correct).
  3. At the beginning of do_update check all endpoints and remove those that are not reachable or have the wrong network id or chain id. This is easier than 2. but may result in unnecessary requests to the fallbacks when they are not used during the update.

I am currently preparing 2. since I think its the best solution but if you disagree please tell me ;).

@blacktemplar blacktemplar added the work-in-progress PR is a work-in-progress label Nov 18, 2020
@blacktemplar
Copy link
Contributor Author

blacktemplar commented Nov 20, 2020

This is now ready for review. A few notes:

  • I had to refactor a few parts of the eth1 service so that requests are grouped together before using fallbacks. This is important since if we have a request A and a following request B and we only know if A was successful after processing B we should group them together and process both with one endpoint and if one of both is not successful fallback. On the other hand we only want to do the minimal necessary grouping since otherwise we may redo a lot of unnecessary requests when falling back.
  • The endpoint's network_id and chain_id is now always checked once for each used endpoint, when calling any of the methods that expects an endpoint list. This implies semantic changes for eth1_genesis_service but I think it is also good to check the network_id and chain_id there.
  • I introduced now a simple mechanism to detect that a node fall behind and we should switch to a fallback node. At the beginning of the update method we download the latest block from the endpoint (this didn't change in this PR). We can then use this block to check if its timestamp is far in the past. To specify "far" I added a config value node_far_behind_seconds which means if a timestamp of the latest block of an endpoint is in the past more than node_far_behind_seconds seconds we consider that endpoint to be out of sync and fallback. This config value is currently initialized with max(5, spec.eth1_follow_distance / 2) * spec.seconds_per_eth1_block (note that the max is only for degenerated specs).

@blacktemplar blacktemplar added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Nov 20, 2020
/// endpoint. If no endpoint is usable it returns None. Usability of endpoints is checked lazily
/// using the EndpointsCache structure. For each endpoint if it returns an error the on_err function
/// is called.
macro_rules! with_fallback_and_on_err {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro is cool, but I'm not convinced that we need to use a macro here. Generally, I find not writing a macro is the more maintainable and extensible solution; compile errors are simpler, imports/exports are easier and the types/generics are more clearly expressed. Of course, there are times where macros are needed or are simpler.

This is how I would go about this problem without templating/macros: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=564a7fc931b10cfc757f752afe261604

Considering that we're going to need this same fallback logic in the VC, I would suggest to use a struct-based approach (like in the the play above) defined in a new crate in common/ that can be shared between validator_client/ and beacon_node/eth1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can even include the validator client fallback in this PR, if you like like :)

Copy link
Contributor Author

@blacktemplar blacktemplar Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed review. I wanted to go a similar route at the beginning but then probably abandoned the idea too early after having problems integrating everything in an async way. I will have a try at that again...

@paulhauner paulhauner changed the base branch from master to unstable November 23, 2020 23:20
# Conflicts:
#	beacon_node/eth1/src/service.rs
#	lcli/src/eth1_genesis.rs
Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, apart from the unfortunate merge conflict :)

Err(FallbackError::AllErrored(errors))
}

pub fn map_format_error<'a, E, F, S>(&'a self, f: F, error: &FallbackError<E>) -> String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a little while to understand what this doing, but it looks like it does it faithfully.

An alternate approach could be based upon this rough sketch:

impl<T: Display> Fallback<T> {
  pub fn format_error(&self, error: &FallbackError<E>) -> String {
    ..
  }
}

impl Display for EndpointWithState {
  ..
}

I won't block on this though, it works well as is :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do that yet, since I will change that part anyway for restructuring for the bn case.

.await
.map_err(error_connecting)?;
if &network_id != config_network_id {
warn!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in conflict with #1981 unfortunately. I think what we need to do is return some error if we get chain_id ==0 so that we try again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am returning now an EndpintError::FarBehind error + logging a warning.

@paulhauner paulhauner added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Nov 27, 2020
# Conflicts:
#	beacon_node/eth1/src/service.rs
@AgeManning
Copy link
Member

bors r+

@AgeManning AgeManning added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Nov 27, 2020
bors bot pushed a commit that referenced this pull request Nov 27, 2020
## Issue Addressed

part of  #1883

## Proposed Changes

Adds a new cli argument `--eth1-endpoints` that can be used instead of `--eth1-endpoint` to specify a comma-separated list of endpoints. If the first endpoint returns an error for some request the other endpoints are tried in the given order.

## Additional Info

Currently if the first endpoint fails the fallbacks are used silently (except for `try_fallback_test_endpoint` that is used in `do_update` which logs a `WARN` for each endpoint that is not reachable). A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.
Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @blacktemplar. I think users will really appreciate this one.

I'll publish a release in the coming hours :)

@bors
Copy link

bors bot commented Nov 27, 2020

@bors bors bot changed the title Fallback nodes for eth1 access [Merged by Bors] - Fallback nodes for eth1 access Nov 27, 2020
@bors bors bot closed this Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge This PR is ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants