Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/regression: relayed messages reach recently started peer with a big delay (~60 seconds) #2388

Closed
fbarbu15 opened this issue Feb 1, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@fbarbu15
Copy link
Contributor

fbarbu15 commented Feb 1, 2024

Problem

This is probably only noticeable for automated tests where we start the node before each test and stop it at the end.
But the impact is big, the suite execution duration nearly tripled:
image
Also this might uncover a bigger issue

To reproduce

  1. Start 2 relay nodes connected to the same topic
  2. Immediately after they started, publish message from node1 (with POST /relay/v1/messages)
  3. Check that the relayed messages reach both nodes (with GET /relay/v1/messages)

Actual behavior

Node1 sees the messages immediately, however for node2 it takes about 60 seconds until it can see the published messages by the node1

nwaku version/commit hash

This can be see with harbor.status.im/wakuorg/nwaku:latest but doesn't reproduce with harbor.status.im/wakuorg/nwaku:v0.24.0 (where it takes around 15 seconds) nor with harbor.status.im/wakuorg/go-waku:latest (where it takes just 1 second)
Added docker logs with this versions for comparison purposes

go-waku_latest_1_second.zip
nwaku_latest_60_seconds.zip
nwaku_v0.24.0_12_seconds.zip

Docker start flags

  • node1: ['--listen-address=0.0.0.0', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=TRACE', '--rest-relay-cache-capacity=100', '--websocket-port=14657', '--rest-port=14655', '--tcp-port=14656', '--discv5-udp-port=14658', '--rest-address=0.0.0.0', '--nat=extip:172.18.94.230', '--peer-exchange=true', '--discv5-discovery=true', '--cluster-id=0', '--metrics-server=true', '--metrics-server-address=0.0.0.0', '--metrics-server-port=14659', '--metrics-logging=true', '--relay=true', '--nodekey=30348dd51465150e04a5d9d932c72864c8967f806cce60b5d26afeca1e77eb68']
  • node2: ['--listen-address=0.0.0.0', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=TRACE', '--rest-relay-cache-capacity=100', '--websocket-port=16322', '--rest-port=16320', '--tcp-port=16321', '--discv5-udp-port=16323', '--rest-address=0.0.0.0', '--nat=extip:172.18.194.133', '--peer-exchange=true', '--discv5-discovery=true', '--cluster-id=0', '--metrics-server=true', '--metrics-server-address=0.0.0.0', '--metrics-server-port=16324', '--metrics-logging=true', '--relay=true', '--discv5-bootstrap-node=enr:-Kq4QNtbZhpyehoDXnigU0Si_Hr1g-dVvNx-AnQ-UvdoygcbBJxRhNIwLldG_8g2cOEQrpXdc_fwJh_HEyXbOgcVlKkBgmlkgnY0gmlwhKwSXuaKbXVsdGlhZGRyc4wACgSsEl7mBjlB3QOJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CCOUCDdWRwgjlChXdha3UyAQ']
@fbarbu15 fbarbu15 added the bug Something isn't working label Feb 1, 2024
@gabrielmer
Copy link
Contributor

I really suspect it's related to this #2332 (comment)

@fbarbu15
Copy link
Contributor Author

I really suspect it's related to this #2332 (comment)

Yes, the timing is right, it started to reproduce the day that PR was merged. Thanks @gabrielmer

@gabrielmer
Copy link
Contributor

@SionoiS what do you think we should do?

@SionoiS
Copy link
Contributor

SionoiS commented Feb 14, 2024

The issue is that the connectivity check interval was changed from 15s to 60s.

By adding a peer to the manager, you have to wait 60s for the next check to connect to this new peer.

To speed the tests, add the peers to the peer store before starting the node or force the connection manually instead of waiting on the peer manager. That would makes tests even faster than the previous 15s wait.

@fbarbu15
Copy link
Contributor Author

@SionoiS thanks for the explanation. How do we force the connection manually?

@SionoiS
Copy link
Contributor

SionoiS commented Feb 14, 2024

Call either

proc connectToRelayPeers*(pm: PeerManager) {.async.} =

Or
proc manageRelayPeers*(pm: PeerManager) {.async.} =

for the new shard aware peer management.

@fbarbu15
Copy link
Contributor Author

the tests are using nwaku as a docker container, but I could use this one https://waku-org.github.io/waku-rest-api/#post-/admin/v1/peers . It's the same result ?

@SionoiS
Copy link
Contributor

SionoiS commented Feb 14, 2024

the tests are using nwaku as a docker container, but I could use this one https://waku-org.github.io/waku-rest-api/#post-/admin/v1/peers . It's the same result ?

Ah sorry I misunderstood the context. In that case, yes that end point would connect to the peer directly, bypassing the peer management.

@fbarbu15
Copy link
Contributor Author

Great, thanks for the workaround. Closing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants