Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential critical issue: LockExpired message for initiator #3528

Closed
LefterisJP opened this issue Jan 2, 2019 · 2 comments
Closed

Potential critical issue: LockExpired message for initiator #3528

LefterisJP opened this issue Jan 2, 2019 · 2 comments

Comments

@LefterisJP
Copy link
Contributor

Problem Definition

While working on #3218 I experimented a lot and ended up getting the following stack trace:

2019-01-02 15:52:02.877585 [error    ] updateNonClosingBalanceProof transaction has already been mined and updated the channel succesfully. [raiden.raiden_service]
2019-01-02 15:52:02.879437 [warning  ] Message already in queue - ignoring [raiden.network.transport.matrix] message=<LockExpired [chainid:337 token_network_address:5169a14b msg_id:13879169527902156452 channel_identifier:1 nonce:2 transferred_amount:0 locked_amount:0 locks
root:00000000 secrethash:18ce6381]> queue=<QueueIdentifier recipient:63662b2f channel_identifier:1> receiver=63662b2f
Traceback (most recent call last):         
  File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run
  File "/home/lefteris/.virtualenvs/raiden/lib/python3.7/site-packages/raiden_libs/network/matrix/client.py", line 353, in _handle_response
    self.call(room._put_event, event)
  File "/home/lefteris/.virtualenvs/raiden/lib/python3.7/site-packages/raiden_libs/network/matrix/client.py", line 308, in call
    return callback(*args, **kwargs)    
  File "/home/lefteris/.virtualenvs/raiden/lib/python3.7/site-packages/matrix_client/room.py", line 308, in _put_event
    listener['callback'](self, event)
  File "/home/lefteris/ew/raiden/raiden/network/transport/matrix.py", line 949, in _handle_message
    self._receive_message(message)
  File "/home/lefteris/ew/raiden/raiden/network/transport/matrix.py", line 980, in _receive_message
    self._raiden_service.on_message(message)
  File "/home/lefteris/ew/raiden/raiden/raiden_service.py", line 463, in on_message
    self.message_handler.on_message(self, message)
  File "/home/lefteris/ew/raiden/raiden/message_handler.py", line 45, in on_message
    self.handle_message_lockexpired(raiden, message)
  File "/home/lefteris/ew/raiden/raiden/message_handler.py", line 90, in handle_message_lockexpired
    raiden.handle_state_change(state_change)
  File "/home/lefteris/ew/raiden/raiden/raiden_service.py", line 472, in handle_state_change
    event_list = self.wal.log_and_dispatch(state_change)
  File "/home/lefteris/ew/raiden/raiden/storage/wal.py", line 79, in log_and_dispatch
    events = self.state_manager.dispatch(state_change)
  File "/home/lefteris/ew/raiden/raiden/transfer/architecture.py", line 261, in dispatch
    state_change,
  File "/home/lefteris/ew/raiden/raiden/transfer/node.py", line 1129, in state_transition
    iteration = handle_state_change(chain_state, state_change)
  File "/home/lefteris/ew/raiden/raiden/transfer/node.py", line 873, in handle_state_change
    state_change,
  File "/home/lefteris/ew/raiden/raiden/transfer/node.py", line 668, in handle_receive_lock_expired
    state_change.secrethash,
  File "/home/lefteris/ew/raiden/raiden/transfer/node.py", line 160, in subdispatch_to_paymenttask
    block_number,
  File "/home/lefteris/ew/raiden/raiden/transfer/mediated_transfer/initiator_manager.py", line 400, in state_transition
    block_number,
  File "/home/lefteris/ew/raiden/raiden/transfer/mediated_transfer/initiator_manager.py", line 275, in handle_lock_expired
    channel_state = channelidentifiers_to_channels[channel_identifier]
KeyError: 1
2019-01-02T15:53:23Z <Greenlet "sync_handle_response-s85_9_0_1_7_1_1_4_1" at 0x7f05f7919048: <bound method GMatrixClient._handle_response of <raiden_libs.network.matrix.client.GMatrixClient object at 0x7f05f7d239e8>>({'next_batch': 's86_9_0_1_7_1_1_4_1', 'device_one_, Fals
                                                                                                                                                                                                                                        e)> failed with KeyError

The test involves a refund transfer and the mediator was offline for a very long time during that run and sent the LockExpired message for the refund transfer to the initiator very very late. So late in fact that the channel was cleaned from the state and the channel_identifier was not in channel_identifier_to_channels and thus we got a key error.

def handle_lock_expired(
payment_state: InitiatorPaymentState,
state_change: ReceiveLockExpired,
channelidentifiers_to_channels: typing.ChannelMap,
pseudo_random_generator: random.Random,
block_number: typing.BlockNumber,
) -> TransitionResult:
"""Initiator also needs to handle LockExpired messages when refund transfers are involved.
A -> B -> C
- A sends locked transfer to B
- B attempted to forward to C but has not enough capacity
- B sends a refund transfer with the same secrethash back to A
- When the lock expires B will also send a LockExpired message to A
- A needs to be able to properly process it
Related issue: https://github.com/raiden-network/raiden/issues/3183
"""
channel_identifier = payment_state.initiator.channel_identifier
channel_state = channelidentifiers_to_channels[channel_identifier]
secrethash = payment_state.initiator.transfer.lock.secrethash
result = channel.handle_receive_lock_expired(
channel_state=channel_state,
state_change=state_change,
block_number=block_number,
)

In other cases of LockExpired we would be checking if it's a valid lock expire first and if not do nothing (example here), but since that initiator_manager code is very new we did not think of this case when we added it and there is an unprotected map query which can result in a KeyErrror.

The fix is very simple:

diff --git a/raiden/transfer/mediated_transfer/initiator_manager.py b/raiden/transfer/mediated_transfer/initiator_manager.py
index 771fb9de..f34fa583 100644
--- a/raiden/transfer/mediated_transfer/initiator_manager.py
+++ b/raiden/transfer/mediated_transfer/initiator_manager.py
@@ -2,7 +2,7 @@ import random
 
 from raiden.transfer import channel
 from raiden.transfer.architecture import Event, StateChange, TransitionResult
-from raiden.transfer.events import EventPaymentSentFailed
+from raiden.transfer.events import EventInvalidReceivedLockExpired, EventPaymentSentFailed
 from raiden.transfer.mediated_transfer import initiator
 from raiden.transfer.mediated_transfer.events import EventUnlockClaimFailed, EventUnlockFailed
 from raiden.transfer.mediated_transfer.state import (
@@ -272,6 +272,13 @@ def handle_lock_expired(
     Related issue: https://github.com/raiden-network/raiden/issues/3183
 """
     channel_identifier = payment_state.initiator.channel_identifier
+    if channel_identifier not in channelidentifiers_to_channels:
+        invalid_lock_expired = EventInvalidReceivedLockExpired(
+            secrethash=state_change.secrethash,
+            reason=f'LockExpired for non existing channel identifier {channel_identifier}',
+        )
+        return TransitionResult(payment_state, [invalid_lock_expired])
+
     channel_state = channelidentifiers_to_channels[channel_identifier]
     secrethash = payment_state.initiator.transfer.lock.secrethash
     result = channel.handle_receive_lock_expired(

We would just ignore any such LockExpiredMessage.

Evaluation

@LefterisJP opinion: This can rarely occur in normal operation. And never if both nodes are online. The problem is malicious attackers can use it to their advantage if they know about it. I think if this becomes public it could be used to attack people's nodes and cause them to crash. If you know someone is an initiator for any transfer that is not yet finalized (so that an InitiatorState exists), you can send him a LockExpiredMessage containing any other channel identifier and crash them. Because the map query comes before the is_valid_lock_expired() check all the other fields of the LockExpired message don't matter at all. So it should be fixed ASAP.

On the positive side, it's not easy to figure out this vulnerability exists or create this attack so I am not very worried. We can either:

  • "shadow-fix" at the last moment before the next client release (best case next Friday). Remember that we should also have a testing period before the release and this fix (if we go with a "shadow-fix") should be added after testing is done and only when we are about to tag the release as finalized. Because if we add it to the release candidate then people could potentially see it and exploit it between the release candidate and the actual release.

OR

I think that it's very unlikely someone will find this and exploit it in such an early stage of the network so I am leaning more on the shadow-fix during the next release as long as we go for such a release asap. But I am open to suggestions.

The cleanest approach is an immediate critical fix patch release containing only the patch shown above and perhaps the infura fix Loredana requested as it's definitely harmless. I would just prefer to avoid critical releases unless it's something very serious as it may look bad.

@rakanalh's opinion: This is a DOS attack that crashes nodes and breaks the always-online requirement which could lead to loss of funds. I think we can classify it as critical. However, I think Lefteris' suggestion of making a shadow-fix should be the route to go with.

others: Please edit the issue to add your evaluation here or post below.

@stefante
Copy link

stefante commented Jan 4, 2019

@LefterisJP @rakanalh @heikoheiko : Do we have any commitments re. fixing times for critical bugs communicated?. I.e. ist the the issue #260 communicated or just an internal suggestion. If it is communicated, we should do a critical bug release. If not, it is our judgement and then I agree with you in doing the shadow fix. Given the number of channels currently open the DOS risk is not too high.

@LefterisJP
Copy link
Contributor Author

This has not been communicated publically at all. As per our guidelines before the release for critical issues found internally I described it only here in this private repo and now we decide what to do without any pressure.

. If not, it is our judgement and then I agree with you in doing the shadow fix. Given the number of channels currently open the DOS risk is not too high.

Great. That's exactly my line of thinking too.

@LefterisJP LefterisJP transferred this issue from another repository Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants