Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to reconnect to the eth node if the connection fails. #1093

Merged
merged 3 commits into from Nov 23, 2017
Merged

Try to reconnect to the eth node if the connection fails. #1093

merged 3 commits into from Nov 23, 2017

Conversation

palango
Copy link
Contributor

@palango palango commented Oct 25, 2017

Fixes #707.

This uses the same approach as #767 but a different timeout strategy. I discussed with Konrad and we think it doesn't make to kill the raiden instance. So now raiden tries to reconnect every 3 seconds in the first minute, then every 10s.

One issues I found is that one cannot Ctrl-C stop raiden while it's trying to reconnect. Not sure atm why the events aren't propagated. This is now fixed by adding timeouts to the shutdown sequence.

@@ -308,16 +312,27 @@ def stop(self):
self.alarm.stop_async()
self.protocol.stop_and_wait()

timeout_s = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refrain from unnecessary single letter abbreviations, timeout_seconds would be just fine here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this also should be made into a config in DEFAULT_CONFIG, under a new key ethereum_node or something like it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a command line flag for changing this as well?

self.event_listeners = list()
result = list()
reinstalled_filters = True
self.add_proxies_listeners(get_relevant_proxies(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if recreating the proxy instances here is the right approach (i.e. calling get_relevant_proxies). I suspect that we double the proxy instances by this approach (i.e. the old instances may still be referenced from some other node).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think konrad is correct here. Perhaps try to see where we save the data you need when we install proxies at raiden_service and extract it from there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to re-install from self.event_listeners, or, if they don't hold enough information, extend them to allow for re-installation? That way we also won't have the potentially cyclic dependency of self.chain in this class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to reinstall the filters. Since filters are not persistent, once the node is restarted and the id from the previous run is used the error will be filter not found, reinstalling the filters seems fine under this circumstance.

@LefterisJP The filters are kept inside the BlockchainEvents only.

NVM, you guys are talking about the proxies not the filters, my bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are problems with this patch though.

  1. It may miss events under some race conditions. This will happen if the filters miss a block before being re-installed.
  2. It's passing the chain to yet another object, this may be better handled by something else.

Example for 1.:

- Block 5
- Events are polled
- The node goes offline
- Block 6
- Block 7
- The filters are reinstalled

Under the above scenario, events from the block 6 will be lost. Setting [`fromBlock{ ]](https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_newfilter) with the latest polled block should be fine.

Note: Processing the same event multiple times is not a problem, events must be idempotent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any recommendations on how to handle point 2? I'm not sure how to architect this in a nice way.

@@ -438,6 +439,31 @@ class JSONRPCClient(object):
nonce_offset (int): Network's default base nonce number.
"""

def _checkNodeConnection(func, *args, **kwargs):
def retry_waits():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should reuse the timeout_exponential_backoff generator (better move it outside that module)

@@ -726,6 +752,7 @@ def filter_changes(self, fid):
for c in changes
]

@_checkNodeConnection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the transactions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I wonder if the transaction pool is actually persisted, if that is the case the retries may fail like with a known transaction error: #1061

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it is persistent for ones own transactions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it raises the exception with the message known transaction? We need some way to tell if that's bad or good, and handle errors like the one in the PR.

If you want we can merge this and look at the repeated transaction later.

@LefterisJP
Copy link
Contributor

@palango you will need to rebase on top of the changes that split the proxies and filters into their own files.

@palango
Copy link
Contributor Author

palango commented Nov 14, 2017

Yeah, didn't want to rebase before I knew how stuff would work.

@palango
Copy link
Contributor Author

palango commented Nov 15, 2017

Ok, think this is ready for a second round of reviews. Thanks for the good feedback.

  • Now the proxies are reused and just the filters updated after a reconnect.
  • The problem of missing information is handled by giving the filters from_block parameters with the last seen node
  • I still use the two stage timeout generator, feels a bit nice. But open to change if there's other sentiment.
  • I added a setting for the shutdown timeout. Should this be given a command line option as well?

# contact the disconnected client
try:
with gevent.Timeout(self.shutdown_timeout):
self.blockchain_events.uninstall_all_event_listeners()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node is offline, the filters are already gone, I don't think there is much use for waiting for it to come back online.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uninstall_all_event_listeners function calls uninstall on every Filter which results in a RPC-call which will block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the point is that the RPC-call is useless if the ethereum node is offline. Ideally the code would just give up on uninstalling if the ethereum node is offline, instead of waiting for the timeout.

Note: I'm assuming that raiden and the ethereum node are running on the same machine, this actually makes sense for communication through the network, since we can't know for sure if the node is offline or the network is hiccuping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the reasoning but I'm not sure how to handle that. Might be easiest to add a connected property to the JSONRPCClient and set it accordingly in the _checkNodeConnection decorator. Do you think that would work?

Regarding your assumption: I'd say it's a valid assumption to have ethereum node and raiden on the same machine. Did we discuss that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that would work?

Yes, although it's a bit convoluted, can't you expose a version of the rpc client that does not retry by default?

Did we discuss that?

I don't recall this being made an explicit assumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with exposing different kinds of client is that most of the interaction happens over the BlockchainService object. Another option would be to introduce a per-method retry parameter. However I think that complicates stuff more than using the timeout in the end.

for event_listener in self.event_listeners:
new_listener = EventListener(
event_listener.event_name,
event_listener.filter_creation_function(from_block=from_block),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the from_block inclusive?

I believe the value of self._blocknumber from RaidenService is the current block, which is being polled for events.

Note: This actually depends on the order of which the callbacks are installed with the alarm task, if the polling is executed first then self._blocknumber is the previous block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of this it is inclusive, not sure about parity though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, could you add a test for this under the assumptions?

log.info('Client reconnected')
return result

except (requests.exceptions.ConnectionError, InvalidReplyError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the client considered offline when an InvalidReplyError is raised?

I'm wondering what is the content of the response, perhaps the node restarted and Raiden polled for a filter that is gone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just after the disconnect tinyrpc gets an empty response which leads to the InvalidReplyError and in my testing only happens in this case.

Not sure why this happens.

@palango
Copy link
Contributor Author

palango commented Nov 23, 2017

Ok, next version. I removed the decorator from send_transaction, this uses call on all paths, so should be covered.

It would be nice to merge this now and fix some of the issues you brought up later. These are:

I think these are important but can be handled after this PR. Opinions?

@palango palango merged commit 3cb5d52 into raiden-network:master Nov 23, 2017
@palango palango deleted the reconnect branch November 23, 2017 13:36
rakanalh added a commit that referenced this pull request Jun 18, 2018
Introduced 2 exceptions for both SocketFactory and RaidenAPI
servers. Each of those exceptions are raised when their designated
server's port is already in use. The CLI component in this case would
need to catch those specific exceptions and display error messages
accordingly.

Resolves #1093
@rakanalh rakanalh mentioned this pull request Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle eth node restarts
4 participants