New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix improvements #1869

Merged
merged 8 commits into from Jul 23, 2018

Conversation

Projects
None yet
4 participants
@andrevmatos
Collaborator

andrevmatos commented Jul 18, 2018

Some big, but internal to Matrix Transport, changes, which should make room creation, selection and reuse way more reliable:

  • Now, matrix doesn't rely on the room name to find or validate it
  • Communication rooms are choosen by creation or invitation events only, and could be anonymous rooms
  • Room names are used only for better information/debugging, and optional
  • When trying to contact a client, checks if we were invited to some room by him, and use it, or try to create a room and invite them. Creates an annonymous room with him if there's conflict on server (prevents a DoS by someone pre-creating the rooms with the pair names)
  • This address->room_id mappins are stored on matrix server for persistence (done in raiden-network/raiden-libs#107), but not essential, as if we can't contact the client, we can just create another room and invite them to it
  • Prevents a lot of race conditions on room names creation/invitation, by updating the room we use to talk with that address upon receiving a validated message from him in another room
  • All these rooms are still listened on (in case the user roams of server), but we filter and handle only messages by users we have start_health_checked.
  • Requires/assert start_health_check to peers to be able to send a message to them. The only situation it was not done before was between the node receiving a mediated transfer and the sender of it, which was fixed on this PR.

Fix #1843

@LefterisJP

This comment has been minimized.

Collaborator

LefterisJP commented Jul 18, 2018

Fix the failing tests and we can merge this.

@andrevmatos

This comment has been minimized.

Collaborator

andrevmatos commented Jul 18, 2018

I'll just make a small test with a private branch in raiden-libs, to test in travis (issues are timeout related, and doesn't occur locally), and if this works, it should fix the tests, I'll rever the private branch changes, and it should fix the tests.

@andrevmatos andrevmatos force-pushed the andrevmatos:matrix branch 9 times, most recently from 752aded to 8b366b9 Jul 18, 2018

@andrevmatos andrevmatos force-pushed the andrevmatos:matrix branch 5 times, most recently from d1cc667 to e126827 Jul 19, 2018

@andrevmatos andrevmatos force-pushed the andrevmatos:matrix branch 3 times, most recently from cf70041 to 3664aaa Jul 22, 2018

event_unlock_topic = encode_hex(event_abi_to_log_topic(event_unlock_abi))
participant1_topic = encode_hex(self.participant1.rjust(32, b'\0'))
participant2_topic = encode_hex(self.participant2.rjust(32, b'\0'))

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

This rjust is done because encode_single, when encoding an address to bytes32 pads it with zeroes to the right, while events topics encoding is padded with zeroes to the left. Anyone knows of an upstream utils function to do that?

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

Also, construct_event_topic_set were returning a list with a single sublist containing the ordered arguments of the event to be ordered, which pratically just matched the first topic (event's log topic, the hash of the signature of the event) against all parameters, which were useless, and specifically, if included a None (our case), matched everything. We may look into checking that function, and opening an issue or PR on web3' repo to fix it.

encode_hex(encode_single('bytes32', self.channel_identifier)),
channel_topics = [
None, # event topic is any
encode_hex(encode_single('bytes32', self.channel_identifier)), # channel_id

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

This also produced a catch-all filter: topic were something like [[None, <channel_id>]], which essentially matched the event's log topic to None or channel_id, where None matches anything. Instead, we need [None, <channel_id>] or [[None], [<channel_id>]], which matches the event's log topic to anything, and the first argument to the desired channel_id.

This comment has been minimized.

@hackaugusto

hackaugusto Aug 6, 2018

Collaborator

huh, nice catch

@ulope

Looks good. Some comments inline though.

self._discovery_room_alias_full = (
f'#{self._discovery_room_alias}:{discovery_cfg["server"]}'
f'#{self._discovery_room_alias}:' +
config["discovery_room"]["server"]

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Style nitpick: Can't this be inline in the f-string w/o the + joining?

with self._health_semaphore:
candidates = [
self._get_user(user)
for user in self._client.search_user_directory(node_address_hex)

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

client.search_user_directory() already returns User instances. The _get_user() call therefore seems unnecessary.

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

Inside the client, User isn't cached in anyway, it returns a new instance everytime. As I use caching for the _validate_userid_signature inside the User object, I want to cache it. That's why I implemented this _get_user method, it both gets a User for user_id and cache any given (e.g. from rooms or search).

# invalid user displayName signature
return
old_room = self._get_room_id_for_address(peer_address)
if old_room != room.room_id:

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

This will always be true for the initial message since None will be unequal to any room id ;)

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

Exactly, that's the idea =)
So, if no room_id was set before, it'll always set and save it, for future response (if needed).

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Yeah but that also means we always get the log message which is confusing in case there was no room in the first place.

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

Ok, will improve the log message to be clearer in both cases

@@ -541,16 +553,19 @@ def _send_queued_messages(
self,
queueids_to_queues: Dict[Tuple[Address, str], List[Event]],
):
def send_queue(address, events):
def send_queue(address, queue_name, events):

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

The queue_name parameter is unused in_send_async, it's just there to be signature compatible with the udp transport.

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

I know, but as it's there anyway, I thought it'd be nice to keep the interface.

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Fair enough

if not peers and not allow_missing_peers:
self.log.error('No valid peer found', peer_address=address)
return
# no room with expected name => create one and invite peer

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Is this comment still accurate?

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

Yes, because although room name isn't enforced, it's still the first try.

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Ok

@@ -702,6 +734,8 @@ def _handle_presence_change(self, event):
if address not in self._address_to_userids:
return
self._address_to_userids[address].add(user_id)
# maybe inviting user used to also possibly invite user's from discovery presence changes
self._maybe_invite_user(user)

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Shouldn't we only invite if the state is not offline?

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

Part of this PR is exactly about being able to handle invites sent while node is offline (which is pretty good for reliability of matrix transport).

@@ -743,15 +775,7 @@ def _update_address_presence(self, address):
if new_state == self._address_to_presence.get(address):
return
self.log.debug(

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

IMO that log is rather useful.

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

It's mostly covered by the log above, Changing user presence state, so I removed it because it was spamming the logs with mostly the same info.

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

But this is the actual important one since the address presence is what determines the network reachability. So I'd say remove the above on and keep this.

token_address,
) is True
unsaturated_connection_managers = connection_managers[:]
with gevent.Timeout(

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

This changes the timeout from the previous block number based one to a time based one. How does this affect test reliability?

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

The tests shouldn't be affected by it, as we expect the transactions to go sooner than later. The main change with that is that now, the error isn't hidden below 3 layers of utils that does the same thing, and gets prettyprinted, so we can know what went wrong more easily.

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Ok lets see what happens, if this test starts to act up we may need to revisit it.

@@ -53,6 +53,7 @@ def test_direct_transfer_to_offline_node(raiden_network, token_addresses, deposi
# Wait until the initialization of the node is complete and then stop it
gevent.wait([app1.raiden.start_event])
app1.raiden.stop()
gevent.sleep(2)

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

The _logout_timeout in the Matrix transport is `0s by default. So this 2s sleep may not be long enough. Could we wait for an event instead of using a sleep?

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

The logout is mostly synchronous, so it should already have happened here. This is just to ensure a little time before sending the message, to be sure to trigger the event when the node is offline. Not a hard requirement though, just a safer wait that was added at debugging times and I thought would be nice to be kept.

@@ -64,6 +65,7 @@ def test_direct_transfer_to_offline_node(raiden_network, token_addresses, deposi
identifier=payment_identifier,
)
gevent.sleep(2)

This comment has been minimized.

@ulope

ulope Jul 23, 2018

Collaborator

Same as above. Arbitrary sleep values make tests unreliable.

This comment has been minimized.

@andrevmatos

andrevmatos Jul 23, 2018

Collaborator

Again, everything here should be synchronous, this is just to be a little more sure about the state (e.g. here it should ensure the invite sent by the the online node was sent while the offline one was still offline, so it would need to pick this event only when it became online).

@ulope ulope dismissed their stale review Jul 23, 2018

Code changed

@andrevmatos andrevmatos force-pushed the andrevmatos:matrix branch 3 times, most recently from af3aee3 to a0c5cac Jul 23, 2018

@ulope

ulope approved these changes Jul 23, 2018

andrevmatos added some commits Jul 17, 2018

Remove room alias requirements, now rooms are fully invite based
They're stored on the Matrix server, through `account_data`
Sign queued messages
Fix #1891 [ci integration]
Rework PaymentChannel filters
Fix #1910
[ci integration]

@andrevmatos andrevmatos force-pushed the andrevmatos:matrix branch from a0c5cac to e28fc42 Jul 23, 2018

@andrevmatos andrevmatos merged commit 1f354ba into raiden-network:master Jul 23, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details
@hackaugusto

Submitting it now because the PR was merged with the WIP tag, please don't do this.

@@ -103,7 +103,7 @@ def handle_channel_new(raiden, event, current_block_number):
# the channel state was queried
raiden.blockchain_events.add_payment_channel_listener(
channel_proxy,
from_block=data['blockNumber'],
from_block=data['blockNumber'] + 1,

This comment has been minimized.

@hackaugusto

hackaugusto Aug 6, 2018

Collaborator

This looks wrong, there is nothing forbidding two transactions from being mined at the same block, e.g. channel open + channel deposit.

This comment has been minimized.

@andrevmatos

andrevmatos Aug 7, 2018

Collaborator

@hackaugusto As you can see here for the deposit, all channel operations require the channel_identifier, which is not easily calculated off-chain, and is required for all channel operations. So, they can only be performed once the ChannelOpened event is catched and the channel_identifier is known, which means only on next block. Besides that, as we have now StatelessFilters, leaving it as the same block number always caused the ChannelOpened event to be catched and processed twice, one for the actual event which informed us about the channel, other for its filter, which catched everything which had the channel_identifier as second parameter, including the already catched ChannelOpened event again.

This comment has been minimized.

@andrevmatos

andrevmatos Aug 7, 2018

Collaborator

Another option would be to explictly list all events topics we want to catch on PaymentChannel.all_events_filters, leaving ChannelOpened out, but IMO that would only make it easier to miss a new added event, as per the argument above.

This comment has been minimized.

@hackaugusto

hackaugusto Aug 7, 2018

Collaborator

As you can see here for the deposit, all channel operations require the channel_identifier, which is not easily calculated off-chain, and is required for all channel operations.

True, that makes it unlikely to have two transactions at the same block.

one for the actual event which informed us about the channel,

I don't follow, what are the two sources for the same event?

This comment has been minimized.

@andrevmatos

andrevmatos Aug 7, 2018

Collaborator
  1. The TokenNetwork events filter (the right one, which informs us of channels opened), and after 1. was handled:
  2. the PaymentChannel.all_events_filter channel filter, which is supposed to catch all events with this channel_identifier, but if it starts on the same blockNumber as the channel was opened, it'll also match the ChannelOpened event again which created itself.
encode_hex(encode_single('bytes32', self.channel_identifier)),
channel_topics = [
None, # event topic is any
encode_hex(encode_single('bytes32', self.channel_identifier)), # channel_id

This comment has been minimized.

@hackaugusto

hackaugusto Aug 6, 2018

Collaborator

huh, nice catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment