New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages are lost when raiden node shuts down and restarts #3046

Closed
kelsos opened this Issue Nov 16, 2018 · 5 comments

Comments

Projects
None yet
4 participants
@kelsos
Contributor

kelsos commented Nov 16, 2018

I have been trying to replicate the Nonce did not change sequentially issue that causes the nodes to get out of sync, effectively making transfers impossible through the specific path of a channel.

You can find the complete log and database of one of my tests here.

The my test setup is using the production environment, on a private parity dev chain using the transport01 matrix transport and the limit contracts.

The scenario consists of a chain of 5 nodes: A <-> B <-> C <-> D <-> E.

The nodes have the following addresses:

  • A: 0x82641569b2062B545431cF6D7F0A418582865ba7
  • B: 0x504300C525CbE91Adb3FE0944Fe1f56f5162C75C
  • C: 0xc52952Ebad56f2c5E5b42bb881481Ae27D036475
  • D: 0x60FFe4daDE358246C49C4A16bB5b430d333B5Ce9
  • E: 0x53A9462Be18D8f74C1065Be65A58D5A41347e0A6

After each node opens a channel with the next node in the chain a large number of transfers starts from A to E and while the transfers are ongoing D gets restarted. Usually after a couple of restarts, D starts logging the Nonce did not change sequentially message and the nonce is usually off by one.

{
	"node": "60ffe4da",
	"raiden_event": "{\"payment_identifier\": 1542369287, \"reason\": \"Invalid RefundTransfer message. Nonce did not change sequentially, expected: 195 got: 196.\", \"_type\": \"raiden.transfer.events.EventInvalidReceivedTransferRefund\", \"_version\": 0}",
	"event": "Raiden event",
	"logger": "raiden.raiden_service",
	"level": "debug",
	"timestamp": "2018-11-16 11:54:48"
}

If you go to the logs of node C you and search for the missing message,

{
	"current_user": "@0xc52952ebad56f2c5e5b42bb881481ae27d036475:transport01.raiden.network",
	"node": "c52952eb",
	"receiver": "60ffe4da",
	"messages": ["{\"type\": \"LockedTransfer\", \"chain_id\": 17, \"message_identifier\": 6633228396832840312, \"payment_identifier\": 1542369104, \"nonce\": 195, \"token_network_address\": \"0x15008736b8edf6eab458d18440876abd65e57837\", \"token\": \"0x28104ee15e1c70c421150865c3fb731c426929e6\", \"channel_identifier\": 4, \"transferred_amount\": 485, \"locked_amount\": 5, \"recipient\": \"0x60ffe4dade358246c49c4a16bb5b430d333b5ce9\", \"locksroot\": \"0xd179122a0c66f54033173ddd743dabc0639f06363dca1a26f313b3e02f66366e\", \"lock\": {\"type\": \"Lock\", \"amount\": 5, \"expiration\": 1050, \"secrethash\": \"0xfbd409d20845ecc80304527070ffaa1125716dd1268b3ee797e79686b5ce4f1e\"}, \"target\": \"0x53a9462be18d8f74c1065be65a58d5a41347e0a6\", \"initiator\": \"0x82641569b2062b545431cf6d7f0a418582865ba7\", \"fee\": 0, \"signature\": \"0x3fae7eb66da0f5ccefeb1259e0662b4b269422e84fd3a208848a6c8a2108acc27901119b2e7e736e45790a5db7944ef9cff2ee1e0cc343e098fbaf510cd9b2d11c\", \"_type\": \"raiden.messages.LockedTransfer\", \"_version\": 0}"],
	"event": "Send",
	"logger": "raiden.network.transport.matrix",
	"level": "debug",
	"timestamp": "2018-11-16 11:51:45"
}

You will see that the message was send at the exact moment that the receiver was shutting down.

{
	"current_user": "@0x60ffe4dade358246c49c4a16bb5b430d333b5ce9:transport01.raiden.network",
	"node": "60ffe4da",
	"config": {
		"available_servers": ["https://raidentransport.exchangeunion.com", "https://persephone.raidentransport.digitalvirtues.com", "https://raidentransport.ki-decentralized.com", "https://raidentransport.mycryptoapi.com"],
		"discovery_room": "discovery",
		"retries_before_backoff": 5,
		"retry_interval": 5.0,
		"server": "https://transport01.raiden.network",
		"private_rooms": true
	},
	"event": "Matrix stopped",
	"logger": "raiden.network.transport.matrix",
	"level": "debug",
	"timestamp": "2018-11-16 11:51:45"
} {
	"node": "60ffe4da",
	"event": "Alarm task stopped",
	"logger": "raiden.tasks",
	"level": "debug",
	"timestamp": "2018-11-16 11:51:45"
}

If I search for the message_identifier of the missing message on the receiver it seems that the message is nowhere to be found.

Another thing I found interesting is that the sender was also receiving some messages with invalid nonce:

{
	"node": "c52952eb",
	"raiden_event": "{\"payment_identifier\": 1542369104, \"reason\": \"Invalid LockedTransfer message. Nonce did not change sequentially, expected: 190 got: 165.\", \"_type\": \"raiden.transfer.events.EventInvalidReceivedLockedTransfer\", \"_version\": 0}",
	"event": "Raiden event",
	"logger": "raiden.raiden_service",
	"level": "debug",
	"timestamp": "2018-11-16 11:54:30"
}

The message keeps getting send from 60ffe4da to c52952eb on every restart of the 60ffe4da.

From my understanding the message with nonce: 165 was originally send by 60ffe4da moments before the node was shutdown and a processed message was actually send from c52952eb to 60ffe4da on 2018-11-16 11:51:45 so it was the same moment the node was shutting down.

I am not really sure what is going on but from my perspective it seems that some messages get lost while the node is shutting down.

Some extra information about the environment:

{
	"raiden": "0.16.1.dev58+ged16f6e8",
	"python_implementation": "CPython",
	"python_version": "3.7.1",
	"system": "Linux 64bit_ 4.19.1-arch1-1-ARCH x86_64",
	"distribution": "source",
	"client": "Parity-Ethereum//v2.0.9-stable-09f77575d6-20181028/x86_64-linux-gnu/rustc1.30.0",
}

In one of the tests with a previous version the message causing the nonce error was delivered out of order after a restart 30 minutes later. I am not sure if these would help, I have already send these logs to @andrevmatos, but just in case you can find these logs here.

It should be channel 3, nonce 222, msgid 8139983256301530616.

@hackaugusto

This comment has been minimized.

Collaborator

hackaugusto commented Nov 16, 2018

@kelsos did the channel ever get stuck? Messages out-of-order are actually expected to happen.

In the past this was the symptom of a bug, but it's not a bug in itself. If after some time the messages are processed then everything is fine, what we may have is not a bug but a optimization opportunity (as in, messages are not retried fast enough, and we get too many out of order messages).

@kelsos

This comment has been minimized.

Contributor

kelsos commented Nov 16, 2018

In the out of order case the channel was stuck for a while yes. I had a few occurrences of channels getting stuck and getting unstuck after a few restarts, but this was not always the case. After missing a message during the restart the node would usually refuse any message send to it with the 'Nonce did not change sequentially' message, starting usually at one off and increasing as long as you tried to send anything through.

@LefterisJP LefterisJP added this to the Red Eyes Testnet 18 milestone Nov 26, 2018

@rakanalh rakanalh self-assigned this Nov 26, 2018

@rakanalh

This comment has been minimized.

Contributor

rakanalh commented Nov 26, 2018

Some questions:

The my test setup is using the production environment, on a private parity dev chain using the transport01 matrix transport and the limit contracts.

Just to confirm... this is a private chain we're talking about correct?

After each node opens a channel with the next node in the chain a large number of transfers starts from A to E

How many transfers are we talking about here? are they sent sequentially or in parallel?

while the transfers are ongoing D gets restarted

What causes the restart? is it a manual restart? or are you running a scenario through the scenario player?

@kelsos

This comment has been minimized.

Contributor

kelsos commented Nov 26, 2018

Just to confirm... this is a private chain we're talking about correct?

Yes this is a private chain setup.

How many transfers are we talking about here? are they sent sequentially or in parallel?

It doesn't really matter but I was usually sending 1000 mediated transfers and it was sequentially. The script waited for a success from the api before sending the next one

What causes the restart? is it a manual restart? or are you running a scenario through the scenario player?

The restart was manual, unfortunately I can't run this specific scenario with the scenario runner at the moment, because when the route get stuck the /transfer api call on the initiator will block indefinitely and there is no way currently for the scenario to proceed.

@rakanalh

This comment has been minimized.

Contributor

rakanalh commented Nov 27, 2018

Alright, i was able to reproduce the scenario.

Setup phase:

  1. Run a private chain (in my case and @kelsos it's a private parity)
  2. Create 5 accounts.
  3. Use the scripts and follow the steps described here:
    https://gist.github.com/kelsos/604a9ecad3b94f59420fb5901c580e38#file-start_private_chain-py-L20
  4. You have to manually deploy the contracts:
    a. Deploy contracts
    b. Deploy custom test token
    c. Register token in the TokenRegistry
    All the above are described in the raiden-contracts repo.

Test Phase

  1. Run 5 raiden nodes
  2. change nodes.yml from the scripts provided by Kelsos to the initiator and target addresses
  3. Run run_transfers.py and while it's sending transfers, shutdown one of the mediators.
  4. Stop the run_transfers process once it times out (Ctrl-c)
  5. Restart mediator which was previously shutdown
  6. Repeat 3-4.
  7. Run run_transfers.py while mediator is down
  8. Repeat 5

The mediator which was shutdown should have a log with the error described above.

Will investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment