Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix / Synapse presence is flaky #5059

Closed
rakanalh opened this issue Oct 9, 2019 · 4 comments

Comments

@rakanalh
Copy link
Collaborator

@rakanalh rakanalh commented Oct 9, 2019

During testing, we have had multiple instances of the same failure in both the client and the PFS while making a transfer:

  • The client could fail to mediate a transfer due to it's knowledge that a channel's partner is offline.
  • The PFS could fail to provide route also due to seeing one of the channel participants being offline.

While running the BF1 scenario, all 5 nodes were up:

  • 0x90aafBeEbEb11E9b17Bc50E4141aeDFcfDfAD8b9
  • 0xaC6563E01faC0422A9FeC039E7A547C88a48c661
  • 0xD055752E4a98731BF17CE3336654b6777eE8a8df
  • 0x7ebE1fa14F414873Bc713261c17727ec37bdb89F
  • 0x0EeF446E80e9A2057e32c68BA4d5c2128c0b1a74

DeepinScreenshot_select-area_20191009154758

The initial transfer from 0 to 3 failed due to not being provided with routes by the PFS.

{"routes": [], "feedback_token": null, "event": "Received route(s) from PFS", "logger": "raiden.routing", "level": "info", "timestamp": "2019-10-09 12:39:11.476518"}
{"errors": "Payment couldn't be completed because: there is no route available", "status_code": 409, "event": "Error processing request", "logger": "raiden.api.rest", "level": "error", "timestamp": "2019-10-09 12:39:11.527605"}

When @palango investigated on the PFS' side, we found out that the PFS saw the participating nodes as offline:

2019-10-09 12:39:11.451128 [warning  ] Error while handling request   [pathfinding_service.api] details={'from_': '0x7ebE1fa14F414873Bc713261c17727ec37bdb89F', 'to': '0x90aafBeEbEb11E9b17Bc50E4141aeDFcfDfAD8b9', 'value': 1000000000000000} error=NoRouteFound(None) message=No route between nodes found.

All logs & DBs for the nodes can be found here:

run_21.zip

@Dominik1999

This comment has been minimized.

Copy link
Contributor

@Dominik1999 Dominik1999 commented Oct 22, 2019

@rakanalh @err508

I suggest the following procedure to reproduce the Matrix Bug:

  • Set up matrix setup locally inlcuding federation and presence, that provides us with logs
  • Reproducing the bug
    • Locally Rakans script succeeds on federated servers
    • Reproduce BF1 scenario presence behaviour with script and run locally
    • Locally BF1 succeeds on federated servers and local pfs (FAILS FOR RAKAN)
    • Rakan's script succeeds on remote federated servers
    • Remotely BF1 succeeds on federated servers [FAILS]
  • Provide bug report to Matrix
  • Debug on our own
  • Adjust integration tests to catch this bug

This is basically the current status, right?

@err508

This comment has been minimized.

Copy link
Collaborator

@err508 err508 commented Oct 23, 2019

Current status of this issue:

  • the problem was not reproducible locally with a local matrix federation and a local PFS, where the scenario ran successfully

  • due to some changes in the client, we now have the raiden nodes update each others presence consistently throughout multiple scenario runs. However, we still see"errors": "Payment couldn't be completed because: there is no route available" . This was found to be caused by the PFS not tracking the nodes presence correctly.

  • we currently believe, that this error was caused when the transport server databases were deleted. After the transport servers were restarted, due to the nightly scenario player runs, multiple nodes got started simultaneously on different homeservers and as @rakanalh found out, this led to a misconfiguration of the remote servers, where there are multiple global_rooms on the different server, but the PFS would only listen for presence updates in the one on it's own homeserver. This explains why the problem was not reproducible locally, as there the rooms where created correctly for each run on fresh servers and also why the scenario worked when all nodes and the PFS were on the same homeserver.

Planned next steps:

  • transport1 - 4 will be stopped and the database will be deleted
  • The appropriate rooms will be recreated and aliased to avoid any duplicates
@rakanalh

This comment has been minimized.

Copy link
Collaborator Author

@rakanalh rakanalh commented Oct 24, 2019

Path finding rooms on the remote servers:

{
    "aliases": [
        "#raiden_goerli_path_finding:transport01.raiden.network"
    ],
    "canonical_alias": "#raiden_goerli_path_finding:transport01.raiden.network",
    "guest_can_join": false,
    "num_joined_members": 357,
    "room_id": "!zPNQUseHcedZfiQKEg:transport01.raiden.network",
    "world_readable": false
}
{
    "aliases": [
        "#raiden_goerli_path_finding:transport02.raiden.network"
    ],
    "canonical_alias": "#raiden_goerli_path_finding:transport02.raiden.network",
    "guest_can_join": false,
    "num_joined_members": 57,
    "room_id": "!WqXlHrMJeoxnsHdiOQ:transport02.raiden.network",
    "world_readable": false
}
{
    "aliases": [
        "#raiden_goerli_path_finding:transport03.raiden.network"
    ],
    "canonical_alias": "#raiden_goerli_path_finding:transport03.raiden.network",
    "guest_can_join": false,
    "num_joined_members": 44,
    "room_id": "!aowlIeSnJIvsDgpBfD:transport03.raiden.network",
    "world_readable": false
}

This basically means that we have 3 distinct rooms on these servers. Whenever we run scenarios which pass --matrix-server to one of the above server, the node ends up joining the room on that server. As a result, each of the nodes (including the PFS) in a single scenario run has join a different room.

Synapse presence works in a way that our nodes receive presence updates iff there's an intersection in the list of rooms our nodes join. If there's no intersection, we are not allowed to see the other user's presence. For the case of raiden client nodes, this is not a problem. This is because participating nodes in a single channel would join a room specific to that channel which would allow us to see our partner's presence. However, the exception to this is the PFS which relies on the discovery room and the path_finding room to be able to figure out routes. The PFS only joins one discovery and one path finding rooms assuming that these are the only discovery and path_finding rooms on the servers. This is not the case according to the list of rooms above.

@rakanalh rakanalh assigned rakanalh and unassigned ulope Oct 24, 2019
@Dominik1999 Dominik1999 moved this from In progress to In review in Raiden Client Oct 25, 2019
@Dominik1999 Dominik1999 moved this from In review to In progress in Raiden Client Oct 25, 2019
@rakanalh

This comment has been minimized.

Copy link
Collaborator Author

@rakanalh rakanalh commented Oct 28, 2019

Closing this issue as the solution was deemed to be in: raiden-network/raiden-services#609

@rakanalh rakanalh closed this Oct 28, 2019
Raiden Client automation moved this from In progress to Done Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Raiden Client
  
Done
4 participants
You can’t perform that action at this time.