New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to reconnect to the eth node if the connection fails. #1093

Merged
merged 3 commits into from Nov 23, 2017

Conversation

Projects
None yet
4 participants
@palango
Collaborator

palango commented Oct 25, 2017

Fixes #707.

This uses the same approach as #767 but a different timeout strategy. I discussed with Konrad and we think it doesn't make to kill the raiden instance. So now raiden tries to reconnect every 3 seconds in the first minute, then every 10s.

One issues I found is that one cannot Ctrl-C stop raiden while it's trying to reconnect. Not sure atm why the events aren't propagated. This is now fixed by adding timeouts to the shutdown sequence.

@palango palango force-pushed the palango:reconnect branch from cfd538a to a5f3974 Oct 25, 2017

@palango palango force-pushed the palango:reconnect branch 4 times, most recently from 1187e74 to fa86d63 Oct 25, 2017

@palango palango force-pushed the palango:reconnect branch from fa86d63 to 13b2536 Nov 13, 2017

@@ -308,16 +312,27 @@ def stop(self):
self.alarm.stop_async()
self.protocol.stop_and_wait()
timeout_s = 2

This comment has been minimized.

@konradkonrad

konradkonrad Nov 13, 2017

Collaborator

Please refrain from unnecessary single letter abbreviations, timeout_seconds would be just fine here.

This comment has been minimized.

@hackaugusto

hackaugusto Nov 13, 2017

Collaborator

IMO this also should be made into a config in DEFAULT_CONFIG, under a new key ethereum_node or something like it.

This comment has been minimized.

@palango

palango Nov 13, 2017

Collaborator

should we add a command line flag for changing this as well?

self.event_listeners = list()
result = list()
reinstalled_filters = True
self.add_proxies_listeners(get_relevant_proxies(

This comment has been minimized.

@konradkonrad

konradkonrad Nov 13, 2017

Collaborator

Not sure, if recreating the proxy instances here is the right approach (i.e. calling get_relevant_proxies). I suspect that we double the proxy instances by this approach (i.e. the old instances may still be referenced from some other node).

This comment has been minimized.

@LefterisJP

LefterisJP Nov 13, 2017

Collaborator

I think konrad is correct here. Perhaps try to see where we save the data you need when we install proxies at raiden_service and extract it from there?

This comment has been minimized.

@konradkonrad

konradkonrad Nov 13, 2017

Collaborator

Can you try to re-install from self.event_listeners, or, if they don't hold enough information, extend them to allow for re-installation? That way we also won't have the potentially cyclic dependency of self.chain in this class.

This comment has been minimized.

@hackaugusto

hackaugusto Nov 13, 2017

Collaborator

I think it's fine to reinstall the filters. Since filters are not persistent, once the node is restarted and the id from the previous run is used the error will be filter not found, reinstalling the filters seems fine under this circumstance.

@LefterisJP The filters are kept inside the BlockchainEvents only.

NVM, you guys are talking about the proxies not the filters, my bad.

This comment has been minimized.

@hackaugusto

hackaugusto Nov 13, 2017

Collaborator

There are problems with this patch though.

  1. It may miss events under some race conditions. This will happen if the filters miss a block before being re-installed.
  2. It's passing the chain to yet another object, this may be better handled by something else.

Example for 1.:

- Block 5
- Events are polled
- The node goes offline
- Block 6
- Block 7
- The filters are reinstalled

Under the above scenario, events from the block 6 will be lost. Setting [`fromBlock{ ]](https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_newfilter) with the latest polled block should be fine.

Note: Processing the same event multiple times is not a problem, events must be idempotent.

This comment has been minimized.

@palango

palango Nov 14, 2017

Collaborator

Any recommendations on how to handle point 2? I'm not sure how to architect this in a nice way.

@@ -438,6 +439,31 @@ class JSONRPCClient(object):
nonce_offset (int): Network's default base nonce number.
"""
def _checkNodeConnection(func, *args, **kwargs):
def retry_waits():

This comment has been minimized.

@hackaugusto

hackaugusto Nov 13, 2017

Collaborator

I think this should reuse the timeout_exponential_backoff generator (better move it outside that module)

@@ -726,6 +752,7 @@ def filter_changes(self, fid):
for c in changes
]
@_checkNodeConnection

This comment has been minimized.

@hackaugusto

hackaugusto Nov 13, 2017

Collaborator

what about the transactions?

This comment has been minimized.

@palango

palango Nov 14, 2017

Collaborator

Good point!

This comment has been minimized.

@hackaugusto

hackaugusto Nov 16, 2017

Collaborator

Btw, I wonder if the transaction pool is actually persisted, if that is the case the retries may fail like with a known transaction error: #1061

This comment has been minimized.

@palango

palango Nov 23, 2017

Collaborator

Apparently it is persistent for ones own transactions.

This comment has been minimized.

@hackaugusto

hackaugusto Nov 23, 2017

Collaborator

So, it raises the exception with the message known transaction? We need some way to tell if that's bad or good, and handle errors like the one in the PR.

If you want we can merge this and look at the repeated transaction later.

@palango palango referenced this pull request Nov 13, 2017

Closed

Handle eth node disconnect #767

@palango palango force-pushed the palango:reconnect branch from 2ede5d0 to 554d016 Nov 14, 2017

@LefterisJP

This comment has been minimized.

Collaborator

LefterisJP commented Nov 14, 2017

@palango you will need to rebase on top of the changes that split the proxies and filters into their own files.

@palango

This comment has been minimized.

Collaborator

palango commented Nov 14, 2017

Yeah, didn't want to rebase before I knew how stuff would work.

@palango palango force-pushed the palango:reconnect branch from 0867585 to a3c78a3 Nov 15, 2017

@palango palango referenced this pull request Nov 15, 2017

Closed

Reconnect if ethereum node goes offline #1128

4 of 4 tasks complete

@palango palango force-pushed the palango:reconnect branch 2 times, most recently from 26523dd to 36ea8f5 Nov 15, 2017

@palango

This comment has been minimized.

Collaborator

palango commented Nov 15, 2017

Ok, think this is ready for a second round of reviews. Thanks for the good feedback.

  • Now the proxies are reused and just the filters updated after a reconnect.
  • The problem of missing information is handled by giving the filters from_block parameters with the last seen node
  • I still use the two stage timeout generator, feels a bit nice. But open to change if there's other sentiment.
  • I added a setting for the shutdown timeout. Should this be given a command line option as well?
# contact the disconnected client
try:
with gevent.Timeout(self.shutdown_timeout):
self.blockchain_events.uninstall_all_event_listeners()

This comment has been minimized.

@hackaugusto

hackaugusto Nov 15, 2017

Collaborator

If the node is offline, the filters are already gone, I don't think there is much use for waiting for it to come back online.

This comment has been minimized.

@palango

palango Nov 15, 2017

Collaborator

The uninstall_all_event_listeners function calls uninstall on every Filter which results in a RPC-call which will block.

This comment has been minimized.

@hackaugusto

hackaugusto Nov 15, 2017

Collaborator

Yes, the point is that the RPC-call is useless if the ethereum node is offline. Ideally the code would just give up on uninstalling if the ethereum node is offline, instead of waiting for the timeout.

Note: I'm assuming that raiden and the ethereum node are running on the same machine, this actually makes sense for communication through the network, since we can't know for sure if the node is offline or the network is hiccuping.

This comment has been minimized.

@palango

palango Nov 15, 2017

Collaborator

I agree with the reasoning but I'm not sure how to handle that. Might be easiest to add a connected property to the JSONRPCClient and set it accordingly in the _checkNodeConnection decorator. Do you think that would work?

Regarding your assumption: I'd say it's a valid assumption to have ethereum node and raiden on the same machine. Did we discuss that?

This comment has been minimized.

@hackaugusto

hackaugusto Nov 15, 2017

Collaborator

Do you think that would work?

Yes, although it's a bit convoluted, can't you expose a version of the rpc client that does not retry by default?

Did we discuss that?

I don't recall this being made an explicit assumption.

This comment has been minimized.

@palango

palango Nov 23, 2017

Collaborator

The problem with exposing different kinds of client is that most of the interaction happens over the BlockchainService object. Another option would be to introduce a per-method retry parameter. However I think that complicates stuff more than using the timeout in the end.

for event_listener in self.event_listeners:
new_listener = EventListener(
event_listener.event_name,
event_listener.filter_creation_function(from_block=from_block),

This comment has been minimized.

@hackaugusto

hackaugusto Nov 15, 2017

Collaborator

Is the from_block inclusive?

I believe the value of self._blocknumber from RaidenService is the current block, which is being polled for events.

Note: This actually depends on the order of which the callbacks are installed with the alarm task, if the polling is executed first then self._blocknumber is the previous block.

This comment has been minimized.

@palango

palango Nov 15, 2017

Collaborator

As of this it is inclusive, not sure about parity though.

This comment has been minimized.

@hackaugusto

hackaugusto Nov 15, 2017

Collaborator

Btw, could you add a test for this under the assumptions?

log.info('Client reconnected')
return result
except (requests.exceptions.ConnectionError, InvalidReplyError):

This comment has been minimized.

@hackaugusto

hackaugusto Nov 15, 2017

Collaborator

Why is the client considered offline when an InvalidReplyError is raised?

I'm wondering what is the content of the response, perhaps the node restarted and Raiden polled for a filter that is gone?

This comment has been minimized.

@palango

palango Nov 15, 2017

Collaborator

Just after the disconnect tinyrpc gets an empty response which leads to the InvalidReplyError and in my testing only happens in this case.

Not sure why this happens.

@palango palango force-pushed the palango:reconnect branch 2 times, most recently from d102032 to ad5e52b Nov 22, 2017

palango added some commits Nov 15, 2017

@palango palango force-pushed the palango:reconnect branch from ad5e52b to 0c69924 Nov 23, 2017

@palango

This comment has been minimized.

Collaborator

palango commented Nov 23, 2017

Ok, next version. I removed the decorator from send_transaction, this uses call on all paths, so should be covered.

It would be nice to merge this now and fix some of the issues you brought up later. These are:

  • handle the known transaction case gracefully #1061
  • get rid of the timeout in the shutdown sequence #1149
  • Decide if we want a command line setting or environment variable for the timeouts

I think these are important but can be handled after this PR. Opinions?

@palango palango merged commit 3cb5d52 into raiden-network:master Nov 23, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@palango palango deleted the palango:reconnect branch Nov 23, 2017

rakanalh added a commit that referenced this pull request Jun 18, 2018

Resolves address in use:
Introduced 2 exceptions for both SocketFactory and RaidenAPI
servers. Each of those exceptions are raised when their designated
server's port is already in use. The CLI component in this case would
need to catch those specific exceptions and display error messages
accordingly.

Resolves #1093

@rakanalh rakanalh referenced this pull request Jun 18, 2018

Merged

Resolve address in use #1582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment