New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mediated transfer stuck due to nonce mismatch #2918

Closed
palango opened this Issue Oct 29, 2018 · 14 comments

Comments

Projects
None yet
6 participants
@palango
Collaborator

palango commented Oct 29, 2018

I tried a payment from my node (0x28898a6b762d9c400ff900eff66f2885f0b28faf) to @CosminNechifor node (0xD6e68536eCe0Eac02b39317FdCE814087Bae6118) this morning.

661 2018-10-29 08:14:58 [debug    ] Initiating payment             [raiden.api.rest] amount=100 node=28898a6b payment_identifier=1540800898221 registr
     y_address=0xB9633dd9a9a71F22C933bF121d7a22008f66B908 target_address=0xD6e68536eCe0Eac02b39317FdCE814087Bae6118 token_address=0xC02aaA39b223FE8D0A0
     e5C4F27eAD9083C756Cc2
 662 2018-10-29 08:14:58 [debug    ] initiating transfer            [raiden.api.python] amount=100 identifier=1540800898221 initiator=28898a6b target=d
     6e68536 token=c02aaa39
 663 2018-10-29 08:14:58 [debug    ] State change                   [raiden.raiden_service] node=28898a6b state_change={"transfer": {"payment_network_i
     dentifier": "0xB9633dd9a9a71F22C933bF121d7a22008f66B908", "payment_identifier": 1540800898221, "amount": 100, "token_network_identifier": "0xa5C9E
     Cf54790334B73E5DfA1ff5668eB425dC474", "initiator": "0x28898A6b762d9c400FF900EFF66F2885F0B28FAF", "target": "0xD6e68536eCe0Eac02b39317FdCE814087Bae
     6118", "secret": "0xbb3852ff87fe869d9b4c07b8d0780ba37f6c10a64d6cc70347e53470d06182d4", "_type": "raiden.transfer.mediated_transfer.state.TransferD
     escriptionWithSecretState", "_version": 0}, "routes": [{"node_address": "0x10F4BC8CD156B4785f8F1f100C53C4cA0181E9b9", "channel_identifier": 46, "_
     type": "raiden.transfer.state.RouteState", "_version": 0}, {"node_address": "0xf9ED3cCa3e56A7B72049e6e830C39fd842A2F8e6", "channel_identifier": 44
     , "_type": "raiden.transfer.state.RouteState", "_version": 0}, {"node_address": "0x06f5140A415f4E3969f7e40daEcCFd0558c9cE26", "channel_identifier"
     : 47, "_type": "raiden.transfer.state.RouteState", "_version": 0}, {"node_address": "0xe89Ce3756690Ac7937972797471d7c84F9818ac6", "channel_identif
     ier": 45, "_type": "raiden.transfer.state.RouteState", "_version": 0}], "_type": "raiden.transfer.mediated_transfer.state_change.ActionInitInitiat
     or", "_version": 0}

This took the route over node 0x10f4bc8cd156b4785f8f1f100c53c4ca0181e9b9, which is one of the hosted nodes. In the logs there we see:

54806 2018-10-29 08:14:58 [debug    ] Message data                   [raiden.network.transport.matrix] current_user=@0x10f4bc8cd156b4785f8f1f100c53c4ca
      0181e9b9:persephone.raidentransport.digitalvirtues.com data={"type": "LockedTransfer", "chain_id": 1, "message_identifier": 15665816987759021927,
       "payment_identifier": 1540800898221, "nonce": 4, "token_network_address": "0xa5c9ecf54790334b73e5dfa1ff5668eb425dc474", "token": "0xc02aaa39b223
      fe8d0a0e5c4f27ead9083c756cc2", "channel_identifier": 46, "transferred_amount": 10, "locked_amount": 1100, "recipient": "0x10f4bc8cd156b4785f8f1f1
      00c53c4ca0181e9b9", "locksroot": "0x664cf7e5bdceb8ea5fbd7edab84c9792f7c9e41465bc6731d023ad1a454ada73", "lock": {"type": "Lock", "amount": 100, "e
      xpiration": 6604309, "secrethash": "0x688362550af8ac4b047d0570a16a1a52a641690c93791cb5bf666782e8365dfd"}, "target": "0xd6e68536ece0eac02b39317fdc
      e814087bae6118", "initiator": "0x28898a6b762d9c400ff900eff66f2885f0b28faf", "fee": 0, "signature": "0x321bb4cabf4a303cde0e91444bee60b569524285f1a
      5c4e2863c2af519479e9d2c77a623f86da2015ecfc32cf28bd7e9a1ab6cd9bf8f76008a812c81bb4e01631b"} node=10f4bc8c room=<Room id="!GkWjoGUnqPFVpsmWlL:persep
      hone.raidentransport.digitalvirtues.com" aliases=[]> sender=28898a6b sender_user=<User id="@0x28898a6b762d9c400ff900eff66f2885f0b28faf:persephone
      .raidentransport.digitalvirtues.com">
54807 2018-10-29 08:14:58 [debug    ] Message received               [raiden.network.transport.matrix] current_user=@0x10f4bc8cd156b4785f8f1f100c53c4ca
      0181e9b9:persephone.raidentransport.digitalvirtues.com message=<LockedTransfer [chainid:1 msgid:15665816987759021927 paymentid:1540800898221 toke
      n_network:a5c9ecf5 channel_identifier:46 nonce:4 transferred_amount:10 locked_amount:1100 locksroot:664cf7e5 hash:6e63aed5 secrethash:68836255 ex
      piration:6604309 amount:100]> node=10f4bc8c sender=28898a6b
54808 2018-10-29 08:14:58 [debug    ] Send                           [raiden.network.transport.matrix] current_user=@0x10f4bc8cd156b4785f8f1f100c53c4ca
      0181e9b9:persephone.raidentransport.digitalvirtues.com data={"type": "Delivered", "delivered_message_identifier": 15665816987759021927, "signatur
      e": "0x817f8f002ff9c50238b5f364bf53ed031d58b39958492c20b6a977ba57b84feb42cb71285cde5decd36e74892eea9975467dd2cac3a7664a13775d23a562fbdb1b", "_typ
      e": "raiden.messages.Delivered", "_version": 0} node=10f4bc8c receiver=28898a6b room=<Room id="!GkWjoGUnqPFVpsmWlL:persephone.raidentransport.dig
      italvirtues.com" aliases=[]>
54809 2018-10-29 08:14:58 [info     ] partner for channel state isn't reachable, ignoring [raiden.routing] from_address=10f4bc8c partner_address=0e5215
      99 status=unreachable
54810 2018-10-29 08:14:58 [info     ] partner for channel state isn't reachable, ignoring [raiden.routing] from_address=10f4bc8c partner_address=4d6fa6
      0e status=unknown
54811 2018-10-29 08:14:58 [debug    ] State change                   [raiden.raiden_service] node=10f4bc8c state_change={"routes": [{"node_address": "0
      xD6e68536eCe0Eac02b39317FdCE814087Bae6118", "channel_identifier": 35, "_type": "raiden.transfer.state.RouteState", "_version": 0}], "from_route":
       {"node_address": "0x28898A6b762d9c400FF900EFF66F2885F0B28FAF", "channel_identifier": 46, "_type": "raiden.transfer.state.RouteState", "_version"
      : 0}, "from_transfer": {"message_identifier": 15665816987759021927, "payment_identifier": 1540800898221, "token": "0xC02aaA39b223FE8D0A0e5C4F27eA
      D9083C756Cc2", "balance_proof": {"nonce": 4, "transferred_amount": 10, "locked_amount": 1100, "locksroot": "0x664cf7e5bdceb8ea5fbd7edab84c9792f7c
      9e41465bc6731d023ad1a454ada73", "token_network_identifier": "0xa5C9ECf54790334B73E5DfA1ff5668eB425dC474", "channel_identifier": 46, "message_hash
      ": "0xcbb69244a98da1cf5115ef7e08213e04716f479173d19ea9b054d90723a3959a", "signature": "0x321bb4cabf4a303cde0e91444bee60b569524285f1a5c4e2863c2af5
      19479e9d2c77a623f86da2015ecfc32cf28bd7e9a1ab6cd9bf8f76008a812c81bb4e01631b", "sender": "0x28898A6b762d9c400FF900EFF66F2885F0B28FAF", "chain_id":
      1, "_type": "raiden.transfer.state.BalanceProofSignedState", "_version": 0}, "lock": {"amount": 100, "expiration": 6604309, "secrethash": "0x6883
      62550af8ac4b047d0570a16a1a52a641690c93791cb5bf666782e8365dfd", "_type": "raiden.transfer.state.HashTimeLockState", "_version": 0}, "initiator": "
      0x28898A6b762d9c400FF900EFF66F2885F0B28FAF", "target": "0xD6e68536eCe0Eac02b39317FdCE814087Bae6118", "_type": "raiden.transfer.mediated_transfer.
      state.LockedTransferSignedState", "_version": 0}, "balance_proof": {"nonce": 4, "transferred_amount": 10, "locked_amount": 1100, "locksroot": "0x
      664cf7e5bdceb8ea5fbd7edab84c9792f7c9e41465bc6731d023ad1a454ada73", "token_network_identifier": "0xa5C9ECf54790334B73E5DfA1ff5668eB425dC474", "cha
      nnel_identifier": 46, "message_hash": "0xcbb69244a98da1cf5115ef7e08213e04716f479173d19ea9b054d90723a3959a", "signature": "0x321bb4cabf4a303cde0e9
      1444bee60b569524285f1a5c4e2863c2af519479e9d2c77a623f86da2015ecfc32cf28bd7e9a1ab6cd9bf8f76008a812c81bb4e01631b", "sender": "0x28898A6b762d9c400FF9
      00EFF66F2885F0B28FAF", "chain_id": 1, "_type": "raiden.transfer.state.BalanceProofSignedState", "_version": 0}, "balance_hash": "0xee182cc9fe22ff
      ddc6238fedf1ae022bda01ab9639f0aa4182b44e6ff959664a", "_type": "raiden.transfer.mediated_transfer.state_change.ActionInitMediator", "_version": 0}
54812 2018-10-29 08:14:58 [debug    ] Raiden event                   [raiden.raiden_service] node=10f4bc8c raiden_event={"payment_identifier": 15408008
      98221, "reason": "Invalid LockedTransfer message. Nonce did not change sequentially, expected: 1 got: 4.", "_type": "raiden.transfer.events.Event
      InvalidReceivedLockedTransfer", "_version": 0}

So it seems that there is some other problem going on with the nonce update.

Both nodes are on the latest nightly build.
Initiator logs (starting at line 661):
raiden-debug_2018-10-29T07:55:38.486410.log
Mediator logs (starting at line 54806):
node04.log

@palango palango changed the title from Mediated transfer stuck due to nonce update to Mediated transfer stuck due to nonce mismatch Oct 29, 2018

@palango

This comment has been minimized.

Collaborator

palango commented Oct 29, 2018

@palango

This comment has been minimized.

Collaborator

palango commented Oct 29, 2018

The whole initiator logs from first sync.

concat.log

@kelsos

This comment has been minimized.

Contributor

kelsos commented Oct 30, 2018

It might not necessarily be the same issue as Paul mentioned, but the observed results are the same, and the mediating node is the same.

For me I just open my node for this address for the first time, and Cosmin tried to send me a mediated transfer. I never got the transfer. I also tried to send to both Cosmin through the node and to the node itself but that also failed.

You can find a small part the logs here and the full logs and database here

@palango

This comment has been minimized.

Collaborator

palango commented Oct 30, 2018

We had a look at this in the train, but no definitive conclusions yet.

What seemed strange is that before that mediated payment that got stuck, there were two successful payments using a different channel. However, it seems that the nonce fro this channel didn't start at 1, but right at 4 which might be worth investigating further.

@kelsos

This comment has been minimized.

Contributor

kelsos commented Oct 30, 2018

@palango In my case it seems like after I joined the channel I got a message with nonce 2.

@kelsos

This comment has been minimized.

Contributor

kelsos commented Nov 6, 2018

I managed to replicate a nonce mismatch, channel stuck a couple of times. I am not sure if it is the same issue but I am testing on #2941 to see if this fixes the error and the problem.

The way I managed to get a mediator stuck was by sending a transfer through a path with insufficient funds.

@kelsos kelsos closed this Nov 6, 2018

@kelsos kelsos reopened this Nov 6, 2018

@hackaugusto

This comment has been minimized.

Collaborator

hackaugusto commented Nov 6, 2018

@kelsos please rebase on top of #2941 (with the latest changes from yesterday), and run the tests again. If it happens please provide the logs (and maybe the database too).

@palango

This comment has been minimized.

Collaborator

palango commented Nov 6, 2018

From my understanding this might be the same bug as #2938

@hackaugusto

This comment has been minimized.

Collaborator

hackaugusto commented Nov 6, 2018

sender: 0x28898A6b762d9c400FF900EFF66F2885F0B28FAF
receiver: 0x10F4BC8CD156B4785f8F1f100C53C4cA0181E9b9

@palango , The sender logs node=28898a6b really starts from the nonce 3 for the channel 46:

$ sed -e '/identifier": 46/!d' -e '/Raiden event/!d' -e 's/.*raiden_event=//' raiden-debug_2018-10-29T07.55.38.486410.log | head -n1
{
  "recipient": "0x10F4BC8CD156B4785f8F1f100C53C4cA0181E9b9",
  "channel_identifier": 46,
  "message_identifier": 8241720689174629612,
  "transfer": {
    "payment_identifier": 1540800587588,
    "token": "0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2",
    "balance_proof": {
      "nonce": 3,
      "transferred_amount": 10,
      "locked_amount": 1000,
      "locksroot": "0xdf72ce62d1ec2df92619691e0ad96a0853aec3a0106ed6ea149f9ff63bc2f566",
      "token_network_identifier": "0xa5C9ECf54790334B73E5DfA1ff5668eB425dC474",
      "channel_identifier": 46,
      "chain_id": 1,
      "_type": "raiden.transfer.state.BalanceProofUnsignedState",
      "_version": 0
    },
    "lock": {
      "amount": 1000,
      "expiration": 6604292,
      "secrethash": "0xa9bc88d62a4d22d98cdc7cbf0432d2ac5a120ab264cfa826bb1e7adb74ef47d0",
      "_type": "raiden.transfer.state.HashTimeLockState",
      "_version": 0
    },
    "initiator": "0x28898A6b762d9c400FF900EFF66F2885F0B28FAF",
    "target": "0xD6e68536eCe0Eac02b39317FdCE814087Bae6118",
    "_type": "raiden.transfer.mediated_transfer.state.LockedTransferUnsignedState",
    "_version": 0
  },
  "balance_proof": {
    "nonce": 3,
    "transferred_amount": 10,
    "locked_amount": 1000,
    "locksroot": "0xdf72ce62d1ec2df92619691e0ad96a0853aec3a0106ed6ea149f9ff63bc2f566",
    "token_network_identifier": "0xa5C9ECf54790334B73E5DfA1ff5668eB425dC474",
    "channel_identifier": 46,
    "chain_id": 1,
    "_type": "raiden.transfer.state.BalanceProofUnsignedState",
    "_version": 0
  },
  "balance_hash": "0x2160bc16f4d832394a55209e9059be544bcd973618c8061b50c66d92bc4ec014",
  "_type": "raiden.transfer.mediated_transfer.events.SendLockedTransfer",
  "_version": 0
}

Edit:

The logs of the receiver are way older, we can see that just after the channel was created the first message received really started with nonce=3, so either the sender skipped the nonces 1 and 2, or the messages where never picked up by the transport layer.


Do you have the previous logs for that node? Edit: I just saw the concat.log, and the nonces really start from 3:

$ sed -e '/identifier": 46/!d' -e '/Raiden event/!d' -e 's/.*raiden_event=//' concat.log | jq '.balance_proof.nonce'
3
4
5
6
7
8

Okay, so concat.log is also old enough, we can see when the channel was created, and the first nonce was indeed 3.

My current theory is that the nonce was wrongly increase because of the two prior transfers 1 2


Edit:

After getting the database of the sender with @palango I replayed the state changes and got a different result.

@hackaugusto

This comment has been minimized.

Collaborator

hackaugusto commented Nov 7, 2018

I can not reproduce the error. What I found:

  • The sender's log (concat.log) and the database v9 agree. All state changes in the logs are also in the database, and are identical (Compared by encoding the state changes using the JSONSerializer and md5summing the state change data, ref )
  • Even though the state changes are identical, the produced events are different ref
@hackaugusto

This comment has been minimized.

Collaborator

hackaugusto commented Nov 7, 2018

I found the problem with this script.

The problem is the deserialization and the object cache, turns out that our_state from channels without any transfers are equal, on restart these objects are wrongly shared. The problem persists across restarts because the same object will be serialized and shared. The result is all the channels with a shared state get stuck.

@ulope

This comment has been minimized.

Collaborator

ulope commented Nov 7, 2018

AFAIK this is the second or third time we've had one of those hard to track down issues with the deserialize refcache.

Maybe we need to solve this in a more general way since it seems likely that we could have other objects in the future that are equal but not identical.

@LefterisJP

This comment has been minimized.

Collaborator

LefterisJP commented Nov 7, 2018

@ulope yeah we are completely removing the reference cache now. Check augusto's PR.

@palango

This comment has been minimized.

Collaborator

palango commented Nov 7, 2018

Fixed by #2968

@palango palango closed this Nov 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment