New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to reconnect to the eth node if the connection fails. #1093
Conversation
raiden/raiden_service.py
Outdated
@@ -308,16 +312,27 @@ def stop(self): | |||
self.alarm.stop_async() | |||
self.protocol.stop_and_wait() | |||
|
|||
timeout_s = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refrain from unnecessary single letter abbreviations, timeout_seconds
would be just fine here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this also should be made into a config in DEFAULT_CONFIG, under a new key ethereum_node
or something like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a command line flag for changing this as well?
raiden/blockchain/events.py
Outdated
self.event_listeners = list() | ||
result = list() | ||
reinstalled_filters = True | ||
self.add_proxies_listeners(get_relevant_proxies( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if recreating the proxy instances here is the right approach (i.e. calling get_relevant_proxies
). I suspect that we double the proxy instances by this approach (i.e. the old instances may still be referenced from some other node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think konrad is correct here. Perhaps try to see where we save the data you need when we install proxies at raiden_service and extract it from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to re-install from self.event_listeners
, or, if they don't hold enough information, extend them to allow for re-installation? That way we also won't have the potentially cyclic dependency of self.chain
in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to reinstall the filters. Since filters are not persistent, once the node is restarted and the id from the previous run is used the error will be filter not found
, reinstalling the filters seems fine under this circumstance.
@LefterisJP The filters are kept inside the BlockchainEvents
only.
NVM, you guys are talking about the proxies not the filters, my bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are problems with this patch though.
- It may miss events under some race conditions. This will happen if the filters miss a block before being re-installed.
- It's passing the
chain
to yet another object, this may be better handled by something else.
Example for 1.:
- Block 5
- Events are polled
- The node goes offline
- Block 6
- Block 7
- The filters are reinstalled
Under the above scenario, events from the block 6 will be lost. Setting [`fromBlock{ ]](https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_newfilter) with the latest polled block should be fine.
Note: Processing the same event multiple times is not a problem, events must be idempotent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any recommendations on how to handle point 2? I'm not sure how to architect this in a nice way.
raiden/network/rpc/client.py
Outdated
@@ -438,6 +439,31 @@ class JSONRPCClient(object): | |||
nonce_offset (int): Network's default base nonce number. | |||
""" | |||
|
|||
def _checkNodeConnection(func, *args, **kwargs): | |||
def retry_waits(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should reuse the timeout_exponential_backoff generator (better move it outside that module)
raiden/network/rpc/client.py
Outdated
@@ -726,6 +752,7 @@ def filter_changes(self, fid): | |||
for c in changes | |||
] | |||
|
|||
@_checkNodeConnection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the transactions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I wonder if the transaction pool is actually persisted, if that is the case the retries may fail like with a known transaction error: #1061
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently it is persistent for ones own transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it raises the exception with the message known transaction
? We need some way to tell if that's bad or good, and handle errors like the one in the PR.
If you want we can merge this and look at the repeated transaction later.
@palango you will need to rebase on top of the changes that split the proxies and filters into their own files. |
Yeah, didn't want to rebase before I knew how stuff would work. |
Ok, think this is ready for a second round of reviews. Thanks for the good feedback.
|
# contact the disconnected client | ||
try: | ||
with gevent.Timeout(self.shutdown_timeout): | ||
self.blockchain_events.uninstall_all_event_listeners() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the node is offline, the filters are already gone, I don't think there is much use for waiting for it to come back online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uninstall_all_event_listeners
function calls uninstall
on every Filter
which results in a RPC-call which will block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the point is that the RPC-call is useless if the ethereum node is offline. Ideally the code would just give up on uninstalling if the ethereum node is offline, instead of waiting for the timeout.
Note: I'm assuming that raiden and the ethereum node are running on the same machine, this actually makes sense for communication through the network, since we can't know for sure if the node is offline or the network is hiccuping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the reasoning but I'm not sure how to handle that. Might be easiest to add a connected
property to the JSONRPCClient
and set it accordingly in the _checkNodeConnection
decorator. Do you think that would work?
Regarding your assumption: I'd say it's a valid assumption to have ethereum node and raiden on the same machine. Did we discuss that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that would work?
Yes, although it's a bit convoluted, can't you expose a version of the rpc client that does not retry by default?
Did we discuss that?
I don't recall this being made an explicit assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with exposing different kinds of client is that most of the interaction happens over the BlockchainService
object. Another option would be to introduce a per-method retry
parameter. However I think that complicates stuff more than using the timeout in the end.
for event_listener in self.event_listeners: | ||
new_listener = EventListener( | ||
event_listener.event_name, | ||
event_listener.filter_creation_function(from_block=from_block), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the from_block
inclusive?
I believe the value of self._blocknumber
from RaidenService
is the current block, which is being polled for events.
Note: This actually depends on the order of which the callbacks are installed with the alarm task, if the polling is executed first then self._blocknumber
is the previous block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of this it is inclusive, not sure about parity though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, could you add a test for this under the assumptions?
log.info('Client reconnected') | ||
return result | ||
|
||
except (requests.exceptions.ConnectionError, InvalidReplyError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the client considered offline when an InvalidReplyError
is raised?
I'm wondering what is the content of the response, perhaps the node restarted and Raiden polled for a filter that is gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just after the disconnect tinyrpc
gets an empty response which leads to the InvalidReplyError
and in my testing only happens in this case.
Not sure why this happens.
Ok, next version. I removed the decorator from It would be nice to merge this now and fix some of the issues you brought up later. These are:
I think these are important but can be handled after this PR. Opinions? |
Introduced 2 exceptions for both SocketFactory and RaidenAPI servers. Each of those exceptions are raised when their designated server's port is already in use. The CLI component in this case would need to catch those specific exceptions and display error messages accordingly. Resolves #1093
Fixes #707.
This uses the same approach as #767 but a different timeout strategy. I discussed with Konrad and we think it doesn't make to kill the raiden instance. So now raiden tries to reconnect every 3 seconds in the first minute, then every 10s.
One issues I found is that one cannotThis is now fixed by adding timeouts to the shutdown sequence.Ctrl-C
stop raiden while it's trying to reconnect. Not sure atm why the events aren't propagated.