Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix improvements #1869

Merged
merged 8 commits into from Jul 23, 2018
Merged

Matrix improvements #1869

merged 8 commits into from Jul 23, 2018

Conversation

andrevmatos
Copy link
Contributor

@andrevmatos andrevmatos commented Jul 18, 2018

Some big, but internal to Matrix Transport, changes, which should make room creation, selection and reuse way more reliable:

  • Now, matrix doesn't rely on the room name to find or validate it
  • Communication rooms are choosen by creation or invitation events only, and could be anonymous rooms
  • Room names are used only for better information/debugging, and optional
  • When trying to contact a client, checks if we were invited to some room by him, and use it, or try to create a room and invite them. Creates an annonymous room with him if there's conflict on server (prevents a DoS by someone pre-creating the rooms with the pair names)
  • This address->room_id mappins are stored on matrix server for persistence (done in Matrix improvements raiden-libs#107), but not essential, as if we can't contact the client, we can just create another room and invite them to it
  • Prevents a lot of race conditions on room names creation/invitation, by updating the room we use to talk with that address upon receiving a validated message from him in another room
  • All these rooms are still listened on (in case the user roams of server), but we filter and handle only messages by users we have start_health_checked.
  • Requires/assert start_health_check to peers to be able to send a message to them. The only situation it was not done before was between the node receiving a mediated transfer and the sender of it, which was fixed on this PR.

Fix #1843

@andrevmatos andrevmatos requested a review from ulope July 18, 2018 10:23
@andrevmatos andrevmatos added Type / Enhancement dev: Please Review Type / Optimization Issues that are performance related Component / Transport Transport related issues labels Jul 18, 2018
ulope
ulope previously approved these changes Jul 18, 2018
Copy link
Collaborator

@ulope ulope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@LefterisJP
Copy link
Contributor

Fix the failing tests and we can merge this.

@andrevmatos
Copy link
Contributor Author

andrevmatos commented Jul 18, 2018

I'll just make a small test with a private branch in raiden-libs, to test in travis (issues are timeout related, and doesn't occur locally), and if this works, it should fix the tests, I'll rever the private branch changes, and it should fix the tests.


event_unlock_topic = encode_hex(event_abi_to_log_topic(event_unlock_abi))
participant1_topic = encode_hex(self.participant1.rjust(32, b'\0'))
participant2_topic = encode_hex(self.participant2.rjust(32, b'\0'))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rjust is done because encode_single, when encoding an address to bytes32 pads it with zeroes to the right, while events topics encoding is padded with zeroes to the left. Anyone knows of an upstream utils function to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, construct_event_topic_set were returning a list with a single sublist containing the ordered arguments of the event to be ordered, which pratically just matched the first topic (event's log topic, the hash of the signature of the event) against all parameters, which were useless, and specifically, if included a None (our case), matched everything. We may look into checking that function, and opening an issue or PR on web3' repo to fix it.

encode_hex(encode_single('bytes32', self.channel_identifier)),
channel_topics = [
None, # event topic is any
encode_hex(encode_single('bytes32', self.channel_identifier)), # channel_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also produced a catch-all filter: topic were something like [[None, <channel_id>]], which essentially matched the event's log topic to None or channel_id, where None matches anything. Instead, we need [None, <channel_id>] or [[None], [<channel_id>]], which matches the event's log topic to anything, and the first argument to the desired channel_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, nice catch

Copy link
Collaborator

@ulope ulope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some comments inline though.

self._discovery_room_alias_full = (
f'#{self._discovery_room_alias}:{discovery_cfg["server"]}'
f'#{self._discovery_room_alias}:' +
config["discovery_room"]["server"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick: Can't this be inline in the f-string w/o the + joining?

with self._health_semaphore:
candidates = [
self._get_user(user)
for user in self._client.search_user_directory(node_address_hex)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.search_user_directory() already returns User instances. The _get_user() call therefore seems unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the client, User isn't cached in anyway, it returns a new instance everytime. As I use caching for the _validate_userid_signature inside the User object, I want to cache it. That's why I implemented this _get_user method, it both gets a User for user_id and cache any given (e.g. from rooms or search).

# invalid user displayName signature
return
old_room = self._get_room_id_for_address(peer_address)
if old_room != room.room_id:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always be true for the initial message since None will be unequal to any room id ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, that's the idea =)
So, if no room_id was set before, it'll always set and save it, for future response (if needed).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but that also means we always get the log message which is confusing in case there was no room in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will improve the log message to be clearer in both cases

@@ -541,16 +553,19 @@ def _send_queued_messages(
self,
queueids_to_queues: Dict[Tuple[Address, str], List[Event]],
):
def send_queue(address, events):
def send_queue(address, queue_name, events):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue_name parameter is unused in_send_async, it's just there to be signature compatible with the udp transport.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but as it's there anyway, I thought it'd be nice to keep the interface.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

if not peers and not allow_missing_peers:
self.log.error('No valid peer found', peer_address=address)
return
# no room with expected name => create one and invite peer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because although room name isn't enforced, it's still the first try.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -702,6 +734,8 @@ def _handle_presence_change(self, event):
if address not in self._address_to_userids:
return
self._address_to_userids[address].add(user_id)
# maybe inviting user used to also possibly invite user's from discovery presence changes
self._maybe_invite_user(user)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we only invite if the state is not offline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of this PR is exactly about being able to handle invites sent while node is offline (which is pretty good for reliability of matrix transport).

@@ -743,15 +775,7 @@ def _update_address_presence(self, address):

if new_state == self._address_to_presence.get(address):
return
self.log.debug(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that log is rather useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly covered by the log above, Changing user presence state, so I removed it because it was spamming the logs with mostly the same info.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is the actual important one since the address presence is what determines the network reachability. So I'd say remove the above on and keep this.

token_address,
) is True
unsaturated_connection_managers = connection_managers[:]
with gevent.Timeout(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the timeout from the previous block number based one to a time based one. How does this affect test reliability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests shouldn't be affected by it, as we expect the transactions to go sooner than later. The main change with that is that now, the error isn't hidden below 3 layers of utils that does the same thing, and gets prettyprinted, so we can know what went wrong more easily.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok lets see what happens, if this test starts to act up we may need to revisit it.

@@ -53,6 +53,7 @@ def test_direct_transfer_to_offline_node(raiden_network, token_addresses, deposi
# Wait until the initialization of the node is complete and then stop it
gevent.wait([app1.raiden.start_event])
app1.raiden.stop()
gevent.sleep(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _logout_timeout in the Matrix transport is `0s by default. So this 2s sleep may not be long enough. Could we wait for an event instead of using a sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logout is mostly synchronous, so it should already have happened here. This is just to ensure a little time before sending the message, to be sure to trigger the event when the node is offline. Not a hard requirement though, just a safer wait that was added at debugging times and I thought would be nice to be kept.

@@ -64,6 +65,7 @@ def test_direct_transfer_to_offline_node(raiden_network, token_addresses, deposi
identifier=payment_identifier,
)

gevent.sleep(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Arbitrary sleep values make tests unreliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, everything here should be synchronous, this is just to be a little more sure about the state (e.g. here it should ensure the invite sent by the the online node was sent while the offline one was still offline, so it would need to pick this event only when it became online).

Copy link
Contributor

@hackaugusto hackaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting it now because the PR was merged with the WIP tag, please don't do this.

@@ -103,7 +103,7 @@ def handle_channel_new(raiden, event, current_block_number):
# the channel state was queried
raiden.blockchain_events.add_payment_channel_listener(
channel_proxy,
from_block=data['blockNumber'],
from_block=data['blockNumber'] + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong, there is nothing forbidding two transactions from being mined at the same block, e.g. channel open + channel deposit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hackaugusto As you can see here for the deposit, all channel operations require the channel_identifier, which is not easily calculated off-chain, and is required for all channel operations. So, they can only be performed once the ChannelOpened event is catched and the channel_identifier is known, which means only on next block. Besides that, as we have now StatelessFilters, leaving it as the same block number always caused the ChannelOpened event to be catched and processed twice, one for the actual event which informed us about the channel, other for its filter, which catched everything which had the channel_identifier as second parameter, including the already catched ChannelOpened event again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to explictly list all events topics we want to catch on PaymentChannel.all_events_filters, leaving ChannelOpened out, but IMO that would only make it easier to miss a new added event, as per the argument above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see here for the deposit, all channel operations require the channel_identifier, which is not easily calculated off-chain, and is required for all channel operations.

True, that makes it unlikely to have two transactions at the same block.

one for the actual event which informed us about the channel,

I don't follow, what are the two sources for the same event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The TokenNetwork events filter (the right one, which informs us of channels opened), and after 1. was handled:
  2. the PaymentChannel.all_events_filter channel filter, which is supposed to catch all events with this channel_identifier, but if it starts on the same blockNumber as the channel was opened, it'll also match the ChannelOpened event again which created itself.

encode_hex(encode_single('bytes32', self.channel_identifier)),
channel_topics = [
None, # event topic is any
encode_hex(encode_single('bytes32', self.channel_identifier)), # channel_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, nice catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component / Transport Transport related issues Type / Enhancement Type / Optimization Issues that are performance related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raiden crashes for trying to create an existing "room"
4 participants