Skip to content
This repository has been archived by the owner on Aug 1, 2022. It is now read-only.

Proxy deadlock results in CI failing #2133

Closed
geigerzaehler opened this issue Jul 15, 2021 · 2 comments · Fixed by #2163
Closed

Proxy deadlock results in CI failing #2133

geigerzaehler opened this issue Jul 15, 2021 · 2 comments · Fixed by #2163

Comments

@geigerzaehler
Copy link
Contributor

Since the radicle-link upgrade (#2028) our CI has started failing, both on Github and Buildkite.

It seems that this is caused by radicle-proxy going not binding to the local API anymore. This causes the tests to error with “Failed to fetch”. This error always occurs after the proxy is reset and a new keystore is created. The last log lines from the proxy are

2021-07-13T11:18:53.074 WARN  api::http::control::handler                 > reload requested
2021-07-13T11:18:53.074 INFO  proxy::http                                 > "GET /v1/control/reset HTTP/1.1" 200 585.566µs
2021-07-13T11:18:53.078 INFO  api::process                                > starting API
2021-07-13T11:18:53.083 INFO  proxy::http                                 > "OPTIONS /v1/keystore HTTP/1.1" 200 82.036µs
2021-07-13T11:18:53.098 INFO  proxy::http                                 > "POST /v1/keystore HTTP/1.1" 204 9.948806ms

After this all subsequent tests fail with “Failed to fetch”.

@geigerzaehler geigerzaehler changed the title CI failing after proxy goes away Proxy deadlock results in CI failing Jul 20, 2021
@geigerzaehler
Copy link
Contributor Author

After some more investigation it looks like the proxy is deadlocking. This only seems to happen if the proxy is reloaded multiple times. Reloading the proxy means that the radicle_daemon peer is torn down and reinitialized. The deadlock may be caused by the peer not being torn down properly and not all tasks being aborted.

geigerzaehler added a commit to radicle-dev/radicle-link that referenced this issue Jul 20, 2021
We ensure that the cleanup function returned by `protocol::accept` from
`librad` is called when the `daemon::Peer::run()` future is dropped.
This seems to fix deadlocks that occur when the daemon is torn down
multiple times. See radicle-dev/radicle-upstream#2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
@rudolfs
Copy link
Member

rudolfs commented Jul 20, 2021

This seems to happen also when running Upstream in dev mode via yarn start. It takes about 30min for the hang to occour. Here's what I see in the logs when it hangs:

...
 2021-07-20T14:50:58.533 DEBUG librad::net::peer::storage                  > ask; want=Payload { urn: Urn { id: Oid(6473a027a10e24148872cda22d67bf831ed32adf), path: Some(RefLike("rad/id")) }, rev: Some(Git(e0ad2de9be92f2f5ea12c71c465307d6f5bed906)), origin: None }
 2021-07-20T14:50:58.533 DEBUG librad::net::protocol::tick                 > tock; tock=SendConnected { to: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, message: Gossip(Want { origin: GenericPeerInfo { peer_id: hyy4ebh8amk3ypyawgpjnfpbgkxqwdo5n1iif5s17yghwmewrf4d8a, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [95.111.216.99:12345], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(6473a027a10e24148872cda22d67bf831ed32adf), path: Some(RefLike("rad/id")) }, rev: Some(Git(e0ad2de9be92f2f5ea12c71c465307d6f5bed906)), origin: None } }) }
 2021-07-20T14:50:58.533 INFO  librad::net::protocol::io::send::rpc        > send_rpc; remote_id=hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc remote_addr=34.141.175.140:12345
 2021-07-20T14:50:58.534 DEBUG librad::net::protocol::tick                 > tock
 2021-07-20T14:50:58.576 INFO  librad::net::protocol::io::streams          > new ingress stream
 2021-07-20T14:50:58.577 INFO  librad::net::protocol::broadcast            > apply; remote_id=hybz9gfgtd9d4pd14a6r66j5hz6f77fed4jdu7pana4fxaxbt369kg message=Want { origin: GenericPeerInfo { peer_id: hyy4ebh8amk3ypyawgpjnfpbgkxqwdo5n1iif5s17yghwmewrf4d8a, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [95.111.216.99:12345], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(6473a027a10e24148872cda22d67bf831ed32adf), path: Some(RefLike("rad/id")) }, rev: Some(Git(e0ad2de9be92f2f5ea12c71c465307d6f5bed906)), origin: None } }
 2021-07-20T14:50:58.577 DEBUG librad::net::peer::storage                  > ask; want=Payload { urn: Urn { id: Oid(6473a027a10e24148872cda22d67bf831ed32adf), path: Some(RefLike("rad/id")) }, rev: Some(Git(e0ad2de9be92f2f5ea12c71c465307d6f5bed906)), origin: None }
 2021-07-20T14:50:58.577 DEBUG librad::net::protocol::tick                 > tock; tock=SendConnected { to: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, message: Gossip(Want { origin: GenericPeerInfo { peer_id: hyy4ebh8amk3ypyawgpjnfpbgkxqwdo5n1iif5s17yghwmewrf4d8a, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [95.111.216.99:12345], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(6473a027a10e24148872cda22d67bf831ed32adf), path: Some(RefLike("rad/id")) }, rev: Some(Git(e0ad2de9be92f2f5ea12c71c465307d6f5bed906)), origin: None } }) }
 2021-07-20T14:50:58.577 INFO  librad::net::protocol::io::send::rpc        > send_rpc; remote_id=hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc remote_addr=34.141.175.140:12345
 2021-07-20T14:50:58.578 DEBUG librad::net::protocol::tick                 > tock
 2021-07-20T14:51:57.099 DEBUG librad::net::quic::connection::tracking     > GC conn=140594076971024 prev_epoch=0 curr_epoch=1 tracked_epoch=1
 2021-07-20T14:51:57.099 DEBUG librad::net::quic::connection::tracking     > GC conn=140594154417168 prev_epoch=0 curr_epoch=1 tracked_epoch=1
 2021-07-20T14:53:02.106 DEBUG librad::net::quic::connection::tracking     > GC conn=140594076971024 prev_epoch=1 curr_epoch=2 tracked_epoch=1
 2021-07-20T14:53:02.106 DEBUG librad::net::quic::connection::tracking     > GC conn=140594154417168 prev_epoch=1 curr_epoch=2 tracked_epoch=1
 2021-07-20T14:53:02.106 INFO  librad::net::quic::stream                   > on_stream_error; dir="recv" remote_id=hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc remote_addr=34.141.175.140:12345
 2021-07-20T14:53:02.107 WARN  librad::net::quic::connection               > stream error err=Custom { kind: NotConnected, error: ConnectionClosed(LocallyClosed) }

geigerzaehler added a commit to radicle-dev/radicle-link that referenced this issue Jul 20, 2021
We ensure that the cleanup function returned by `protocol::accept` from
`librad` is called when the `daemon::Peer::run()` future is dropped.
This seems to fix deadlocks that occur when the daemon is torn down
multiple times. See radicle-dev/radicle-upstream#2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit to radicle-dev/radicle-link that referenced this issue Jul 21, 2021
`Peer::run` now properly cleans up the network stack when run to
completion. To this end it accepts a shutdown signal as an argument.
This seems to fix deadlocks that occur when the daemon is torn down
multiple times [1].

Instead of copying the approach from `seed/src/lib.rs` we use a shutdown
future and port this approach to the seed. The reason for this is that
the seed also did not handle shutdown properly. If a termination signal
was sent it did not break the main loop. Moreover we did not react to
shutdown requests while awaiting `peer.bind()`.

[1]: radicle-dev/radicle-upstream#2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit that referenced this issue Jul 21, 2021
Update radicle-link to include radicle-dev/radicle-link#731.
Fixes #2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit that referenced this issue Jul 21, 2021
Update radicle-link to include radicle-dev/radicle-link#731.
Fixes #2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit that referenced this issue Jul 21, 2021
Update radicle-link to include radicle-dev/radicle-link#731.
Fixes #2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit to radicle-dev/radicle-link that referenced this issue Jul 27, 2021
`Peer::run` now properly cleans up the network stack when run to
completion. To this end it accepts a shutdown signal as an argument.
This seems to fix deadlocks that occur when the daemon is torn down
multiple times [1].

Instead of copying the approach from `seed/src/lib.rs` we use a shutdown
future and port this approach to the seed. The reason for this is that
the seed also did not handle shutdown properly. If a termination signal
was sent it did not break the main loop. Moreover we did not react to
shutdown requests while awaiting `peer.bind()`.

[1]: radicle-dev/radicle-upstream#2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit that referenced this issue Jul 27, 2021
Update radicle-link to include radicle-dev/radicle-link#731.
Fixes #2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit that referenced this issue Jul 27, 2021
Update radicle-link to include radicle-dev/radicle-link#731.
Fixes #2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
geigerzaehler added a commit that referenced this issue Jul 27, 2021
Update radicle-link to include radicle-dev/radicle-link#731.
Fixes #2133

Signed-off-by: Thomas Scholtes <geigerzaehler@axiom.fm>
@rudolfs rudolfs closed this as completed in 06bd982 Aug 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants