New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client crashes after channels settle #2932

Closed
kelsos opened this Issue Oct 30, 2018 · 9 comments

Comments

Projects
None yet
4 participants
@kelsos
Contributor

kelsos commented Oct 30, 2018

I was testing with the raiden-v0.15.1.dev6+g14cba904-linux binary on kovan. At some point I decided to leave all the networks and settle all my channels. I had all my nodes online and waited for the all the channels to settle.

As soon as all the channels settled and started disappearing from the webui, one of the nodes crashed. After the crash the node started crashing on start with a different message that seemed similar to the one described by Cosmin on #2931.

Since I couldn't start the client for that account. I removed the database for this specific account. (all the channels were settled before it crashed but if you think this might be somehow a cause for the issue I will give it another try).

After this I started all my nodes and started opening channels in an attempt to get the channels stuck. I somehow got one of the channels out of sync for mediated transfers, but I wanted to try once again the scenario to ensure that this was not a coincidence). So I started leaving the network on all of my nodes
and waited for all my channels to settle.

As soon as all of my channels settled 2 out of the 5 clients running crashed again.

After this, these two nodes would crash each time I would attempt to start them. You can find the original crash message and the restart crash message for both the nodes on the logs logs here and the databases here

@kelsos

This comment has been minimized.

Contributor

kelsos commented Nov 5, 2018

I managed to reproduce it again on 8a305fe so it seems that the issue still exists on master.

@kelsos

This comment has been minimized.

Contributor

kelsos commented Nov 5, 2018

I will try to see if I can create a scenario for the scenario runner that replicates the issue.

@LefterisJP

This comment has been minimized.

Collaborator

LefterisJP commented Nov 5, 2018

I managed to reproduce it again on 8a305fe so it seems that the issue still exists on master.

Yes we did not address this issue. It definitely still exists. We need to find a reproducible way to test it and then fix it.

@LefterisJP

This comment has been minimized.

Collaborator

LefterisJP commented Nov 12, 2018

Full logs given to me by @kelsos

0x60FFe4daDE358246C49C4A16bB5b430d333B5Ce9: https://gist.github.com/LefterisJP/ad86d221e4cacb122c62a19cc703fa81

0xc52952ebad56f2c5e5b42bb881481ae27d036475: https://gist.github.com/LefterisJP/a2611a177d674c1e4185b9e2ed02c29a

@kelsos

This comment has been minimized.

Contributor

kelsos commented Nov 12, 2018

Here are the logs of one of the crashed nodes, of the tests today.

https://gist.github.com/kelsos/00ea38174c6dfc8885df8793b6756700

@LefterisJP

This comment has been minimized.

Collaborator

LefterisJP commented Nov 12, 2018

So it seems that what happens is that once a channel receives a BatchUnlock for a valid channel it will subdispatch it down to the channel state machine and since we were the participant, the channel state is ending and then as a result of that we delete the channel from the channel maps.

Once that is done the problem can manifest in a few different parts of the code by crashing since we iterate the transfers_pairs and we don't delete it from there.

  1. Whenever we receive a new block and handle it at the first function there is an assertion
  2. Whenever we receive a new block at events_for_onchain_secretreveal_if_dangerzone() there is an iteration of transfers_pair which asserts inside get_payer_channel().
    This is where Kelsos' node crashed. What I don't understand is why did it not already crash at (1).
  3. Kelsos also reported a crash above where inside clear_if_finalized there is a key error due to the same reason (channel identifier not in the channel map) here

I suppose the solution is to delete the transfer pair once a channel has been unlocked an completely removed from the channel map.

@kelsos

This comment has been minimized.

Contributor

kelsos commented Nov 12, 2018

It seems that I accidentally pasted the 0x60FFe4 logs for 0xc52952 too.

If you check the full logs for 0xc52952 you can definitely see that the assertion error is logged just before the node stopped because of the crash.

2018-10-30 15:34:04 [error    ] Runnable subtask died!         [raiden.utils.runnable] exc=AssertionError("Couldn't find channel for channel_id: 55",) running=True subtask=<Greenlet "AlarmTask|Greenlet-2" at 0x7fdc3c1cd648: <bound method AlarmTask._run of <raiden.tasks.AlarmTask object at 0x7fdc3b2b13c8>>> this=<RaidenService c52952eb>

In all of my tests usually there was one node crashing with the assertion error and another node crashing with a different error. Today too, two of the three nodes that crashed, crashed with the assertion error and one with the crash I posted above.

LefterisJP added a commit to LefterisJP/raiden that referenced this issue Nov 13, 2018

@LefterisJP

This comment has been minimized.

Collaborator

LefterisJP commented Nov 13, 2018

This is also a critical bug as once it occurs we are going to be stuck in a restart/crash loop due and can no longer use the client.

@czepluch

This comment has been minimized.

Collaborator

czepluch commented Nov 13, 2018

I also have a node that's stuck in this loop. Let me know if you need any information from it. But the logs look like those of Kelsos.

LefterisJP added a commit to LefterisJP/raiden that referenced this issue Nov 14, 2018

rakanalh added a commit to rakanalh/raiden that referenced this issue Nov 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment