Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/regression: Relay connection works no more #2299

Closed
fbarbu15 opened this issue Dec 15, 2023 · 5 comments · Fixed by #2307
Closed

bug/regression: Relay connection works no more #2299

fbarbu15 opened this issue Dec 15, 2023 · 5 comments · Fixed by #2307
Assignees
Labels
bug Something isn't working

Comments

@fbarbu15
Copy link
Contributor

Problem

Using harbor.status.im/wakuorg/nwaku:latest the relay automated tests work no more
The node doesn't seem to connect to peers via relay protocol and published message are not reaching peers
I think it's regression because this used to work and using wakuorg/nwaku:latest from docker hub it works fine
Also it works fine with go-waku from harbor registry (harbor.status.im/wakuorg/go-waku:latest)

To reproduce

  1. Starting Node1 with: ['--listen-address=0.0.0.0', '--rpc=true', '--rpc-admin=true', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=TRACE', '--rest-relay-cache-capacity=100', '--websocket-port=41477', '--rpc-port=41475', '--rest-port=41474', '--tcp-port=41476', '--discv5-udp-port=41478', '--rpc-address=0.0.0.0', '--rest-address=0.0.0.0', '--nat=extip:172.18.115.44', '--peer-exchange=true', '--discv5-discovery=true', '--metrics-server=true', '--metrics-server-address=0.0.0.0', '--metrics-server-port=41479', '--metrics-logging=true', '--relay=true', '--nodekey=30348dd51465150e04a5d9d932c72864c8967f806cce60b5d26afeca1e77eb68']
  2. Starting Node2 with: ['--listen-address=0.0.0.0', '--rpc=true', '--rpc-admin=true', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=TRACE', '--rest-relay-cache-capacity=100', '--websocket-port=63427', '--rpc-port=63425', '--rest-port=63424', '--tcp-port=63426', '--discv5-udp-port=63428', '--rpc-address=0.0.0.0', '--rest-address=0.0.0.0', '--nat=extip:172.18.86.95', '--peer-exchange=true', '--discv5-discovery=true', '--metrics-server=true', '--metrics-server-address=0.0.0.0', '--metrics-server-port=63429', '--metrics-logging=true', '--relay=true', '--discv5-bootstrap-node=enr:-Kq4QEldeRrsAPwEaAtoU_Cbb17SPTXMDxa8V6dlu-JiN-FBU3aPRkwO4iQe5Is0uUYB4Pip3TVjd_JYHY46SYL6gq0BgmlkgnY0gmlwhKwScyyKbXVsdGlhZGRyc4wACgSsEnMsBqIF3QOJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CCogSDdWRwgqIGhXdha3UyAQ']
  3. Subscribe both nodes to pubsub topic /waku/2/rs/18/1 using /relay/v1/subscriptions
  4. Publish messages using POST relay/v1/messages/%2Fwaku%2F2%2Frs%2F18%2F1 . Ex: {"payload": "UmVsYXkgd29ya3MhIQ==", "contentTopic": "/test/1/waku-relay/proto", "timestamp": 1702631412554725120}
  5. Query messages from the other node using GET /relay/v1/messages/%2Fwaku%2F2%2Frs%2F18%2F1

Bug

No messages returned

Expected behavior

Message should be returned

nwaku version/commit hash

harbor.status.im/wakuorg/nwaku:latest

Additional context

In the logs I can see that the nodes do not connect so this is probably the root cause
Added also logs when this works (with wakuorg/go-waku:latest) for comparison
works.zip
doesn_work.zip

@fbarbu15 fbarbu15 added the bug Something isn't working label Dec 15, 2023
@Ivansete-status
Copy link
Collaborator

The nwaku.latest is in the following commit now: https://github.com/waku-org/nwaku/tree/0fc617ff69cc91df7582db994c92649720b53830

@gabrielmer gabrielmer self-assigned this Dec 15, 2023
@gabrielmer
Copy link
Contributor

I see the issue first appearing in dba9820

There seems to be an issue connecting to Relay peers.
In the working scenario, the following logs appear: Connecting to relay peer, Published message to peers while in the newer versions they do not.
In the newer version, it seems that a peer is initially added to the manager but never dialed.

Will keep investigating

@gabrielmer
Copy link
Contributor

Weekly Update

  • achieved: reproduced the issue both in testing framework and with local nodes, analyzed logs and narrowed down to the commit where things got broken
  • next: continue investigating, find root cause and fix

@gabrielmer
Copy link
Contributor

We found different issues. First, main issue causing the test to fail is having changed ConnectivityLoopInterval from 15 seconds to 1 minute

ConnectivityLoopInterval = chronos.minutes(1)

A PR has been opened reverting this change.

There were also issues with the test itself that subscribed to pubsub topics from a different cluster than the one the node is configured to. Confirmed it with @fbarbu15 and that has been resolved.

Apart from this, we also found that we can't start a node without setting pubsub topics in the CLI if the node is in a cluster different than 0

# Check the ENR sharding info for matching config cluster id
if conf.clusterId != 0:
let res = record.toTyped()
if res.isErr():
error "ENR setup failed", error = $res.get()
quit(QuitFailure)
let relayShard = res.get().relaySharding().valueOr:
error "no sharding info"
quit(QuitFailure)
if conf.clusterId != relayShard.clusterId:
error "cluster id mismatch"
quit(QuitFailure)

Will also take care of that separately

@gabrielmer
Copy link
Contributor

Weekly Update

  • achieved: reproduced, investigated, found root causes and fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants