Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: filter/v2/subscriptions take a lot of time and even timeout sometimes #972

Closed
fbarbu15 opened this issue Dec 22, 2023 · 15 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@fbarbu15
Copy link

To Reproduce

  1. Send a filter/v2/subscriptions with a payload like {"requestId": "c712e7f5-a536-4228-8ecc-de395d8b429f", "contentFilters": ["/test/1/waku-filter/proto"], "pubsubTopic": "/waku/2/rs/0/1"}

Actual behavior
Usually this takes ~10 seconds but many times it happens to go beyond 20 and eventually time out (because of the test timeout) and fail the test

go-waku version/commit hash

harbor.status.im/wakuorg/go-waku:latest

Additional context
Added some logs when this request took a long time

timeout_after_20_seconds.log
took_22_seconds.log

@fbarbu15 fbarbu15 added the bug Something isn't working label Dec 22, 2023
@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Dec 26, 2023

Going through the logs i have following observations:

  • The time taken for filter-subscribe is because a filter peer is not connected which matches specific shard and hence triggering peer discovery (which may take some time). This could be the reason for seeing a delay.
  • In the timeout_after_20seconds log file, I could not see any logs related to filter subscribe, so not sure if the request was initiated

I find it odd that discovery takes close to 20 seconds in a test setup. Because i see peers immediately getting identified and getting added.
But i also noticed that the discovered peers doesn't have relay shards as indicated in log below.

210 2023-12-22T12:21:32.825Z ^[[35mDEBUG^[[0m gowaku.node2.peer-manager peermanager/peer_manager.go:323 ENR doesn't have relay shards {"node": "16Uiu2HAmAsB1NRYA6wVeBQCbjbTncsaHnFJEkw46c5GL19X7VeVz", "peer": "16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip"}

Few ways this can be mitigated in test env is :

  • Start the filter service-nodes by subscribing them to static shards (e.g: /waku/2/rs/0/1) rather than default pubsubTopic. This would ensure node's ENR contain shard info.
  • Don't depend on discovery for simple tests, rather add the peer either via REST API (do note go-waku AddPeer REST API differs from nwaku) or statically indicate it as filter-servicenode.

Would be good if you can share what is your test setup for this test.

@chaitanyaprem chaitanyaprem self-assigned this Dec 26, 2023
@fbarbu15
Copy link
Author

I do use the filternode flag when starting the filter node.
Regarding the REST API add peer endpoint, are there similar specs for gowaku so I can see how it works?

I've dug a little deeper and it would seem that this issue happens for the 1st requests after starting the nodes (notice in the test logs bellow that the http://127.0.0.1:10775/filter/v2/subscriptions request times out after 60 seconds)
docker_logs.zip
Also you can see the node starting flags and tests setup in the test logs:

10:29:35 src.node.docker_mananger DEBUG Starting container with image harbor.status.im/wakuorg/go-waku:latest
10:29:35 src.node.docker_mananger DEBUG Using args ['--listen-address=0.0.0.0', '--rpc=true', '--rpc-admin=true', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=DEBUG', '--rest-relay-cache-capacity=100', '--websocket-port=37805', '--rpc-port=37803', '--rest-port=37802', '--tcp-port=37804', '--discv5-udp-port=37806', '--rpc-address=0.0.0.0', '--rest-address=0.0.0.0', '--nat=extip:172.18.1.197', '--peer-exchange=true', '--discv5-discovery=true', '--min-relay-peers-to-publish=0', '--legacy-filter=false', '--relay=true', '--filter=true', '--nodekey=30348dd51465150e04a5d9d932c72864c8967f806cce60b5d26afeca1e77eb68', '--cluster-id=0']
10:29:36 src.node.docker_mananger DEBUG Container started with ID a7ce04d67001. Setting up logs at ./log/docker/node1_2023-12-27_10-29-35__12885372-86c3-45c4-b0c2-f85e97889e3a__harbor.status.im_wakuorg_go-waku:latest.log
10:29:36 src.node.waku_node DEBUG Started container from image harbor.status.im/wakuorg/go-waku:latest. RPC: 37803 REST: 37802 WebSocket: 37805 TCP: 37804
10:29:36 src.libs.common DEBUG Sleeping for 1 seconds
10:29:37 src.node.api_clients.base_client INFO GET call: http://127.0.0.1:37802/debug/v1/info with payload: None
10:29:37 src.node.api_clients.base_client INFO Response status code: 200. Response content: b'{"enrUri":"enr:-Ly4QOcl6Q09dTJ4jdWMK5pUngqtVBYkYk7jyGHBgnXTMtkGINZdK8p_6npqiPJ5VuHXUoUCdWN81OHRVR_dJ3JqDCuGAYyqZLbxgmlkgnY0gmlwhKwRAAKKbXVsdGlhZGRyc5gACgSsEQACBpOt3QMACgR_AAABBpOt3QOJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CCk6yDdWRwgpOuhXdha3UyAQ","listenAddresses":["/ip4/127.0.0.1/tcp/37804/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip","/ip4/127.0.0.1/tcp/37805/ws/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip","/ip4/172.17.0.2/tcp/37804/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip","/ip4/172.17.0.2/tcp/37805/ws/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip"]}'
10:29:37 src.node.waku_node INFO REST service is ready !!
10:29:37 src.steps.filter DEBUG Running fixture setup: setup_main_filter_node
10:29:37 src.node.docker_mananger DEBUG Docker client initialized with image harbor.status.im/wakuorg/go-waku:latest
10:29:37 src.node.waku_node DEBUG WakuNode instance initialized with log path ./log/docker/node2_2023-12-27_10-29-35__12885372-86c3-45c4-b0c2-f85e97889e3a__harbor.status.im_wakuorg_go-waku:latest.log
10:29:37 src.node.waku_node DEBUG Starting Node...
10:29:37 src.node.docker_mananger DEBUG Attempting to create or retrieve network waku
10:29:37 src.node.docker_mananger DEBUG Network waku already exists
10:29:37 src.node.docker_mananger DEBUG Generated random external IP 172.18.218.251
10:29:37 src.node.docker_mananger DEBUG Generated ports [10775, 10776, 10777, 10778, 10779, 10780]
10:29:37 src.node.docker_mananger DEBUG Starting container with image harbor.status.im/wakuorg/go-waku:latest
10:29:37 src.node.docker_mananger DEBUG Using args ['--listen-address=0.0.0.0', '--rpc=true', '--rpc-admin=true', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=DEBUG', '--rest-relay-cache-capacity=100', '--websocket-port=10778', '--rpc-port=10776', '--rest-port=10775', '--tcp-port=10777', '--discv5-udp-port=10779', '--rpc-address=0.0.0.0', '--rest-address=0.0.0.0', '--nat=extip:172.18.218.251', '--peer-exchange=true', '--discv5-discovery=true', '--min-relay-peers-to-publish=1', '--legacy-filter=false', '--relay=false', '--filter=true', '--discv5-bootstrap-node=enr:-Ly4QOcl6Q09dTJ4jdWMK5pUngqtVBYkYk7jyGHBgnXTMtkGINZdK8p_6npqiPJ5VuHXUoUCdWN81OHRVR_dJ3JqDCuGAYyqZLbxgmlkgnY0gmlwhKwRAAKKbXVsdGlhZGRyc5gACgSsEQACBpOt3QMACgR_AAABBpOt3QOJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CCk6yDdWRwgpOuhXdha3UyAQ', '--filternode=/ip4/172.18.1.197/tcp/37804/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', '--cluster-id=0']
10:29:38 src.node.docker_mananger DEBUG Container started with ID 2ad1d3e78997. Setting up logs at ./log/docker/node2_2023-12-27_10-29-35__12885372-86c3-45c4-b0c2-f85e97889e3a__harbor.status.im_wakuorg_go-waku:latest.log
10:29:38 src.node.waku_node DEBUG Started container from image harbor.status.im/wakuorg/go-waku:latest. RPC: 10776 REST: 10775 WebSocket: 10778 TCP: 10777
10:29:38 src.libs.common DEBUG Sleeping for 1 seconds
10:29:39 src.node.api_clients.base_client INFO GET call: http://127.0.0.1:10775/debug/v1/info with payload: None
10:29:39 src.node.api_clients.base_client INFO Response status code: 200. Response content: b'{"enrUri":"enr:-Ly4QKM-6FyLNx_1Ajg7DqY-JTzE0dJnAldAUhd0DNDFnSPlU7FXcwK9UGusENw96-KPCw9rVVyOabBX6fVf9Tj_6yWGAYyqZL1-gmlkgnY0gmlwhKwRAAOKbXVsdGlhZGRyc5gACgSsEQADBioa3QMACgR_AAABBioa3QOJc2VjcDI1NmsxoQIdHACSBpmPFFiDDJp488NdYBuXloED178SFSJBRYdn14N0Y3CCKhmDdWRwgiobhXdha3UygA","listenAddresses":["/ip4/127.0.0.1/tcp/10777/p2p/16Uiu2HAkwPJZvrMJPyvMi1hp1soevFLouJ15qyCuXc2kCQuCb7Yz","/ip4/127.0.0.1/tcp/10778/ws/p2p/16Uiu2HAkwPJZvrMJPyvMi1hp1soevFLouJ15qyCuXc2kCQuCb7Yz","/ip4/172.17.0.3/tcp/10777/p2p/16Uiu2HAkwPJZvrMJPyvMi1hp1soevFLouJ15qyCuXc2kCQuCb7Yz","/ip4/172.17.0.3/tcp/10778/ws/p2p/16Uiu2HAkwPJZvrMJPyvMi1hp1soevFLouJ15qyCuXc2kCQuCb7Yz"]}'
10:29:39 src.node.waku_node INFO REST service is ready !!
------------------------------------------------------------------------------------------------ live log call ------------------------------------------------------------------------------------------------
10:29:39 tests.filter.test_subscribe_create DEBUG Running test with pubsub topic: /waku/2/rs/0/1
10:29:39 src.node.api_clients.base_client INFO POST call: http://127.0.0.1:37802/relay/v1/subscriptions with payload: ["/waku/2/rs/0/1"]
10:29:39 src.node.api_clients.base_client INFO Response status code: 200. Response content: b'true'
10:29:39 src.libs.common DEBUG Sleeping for 10 seconds
10:29:49 src.node.api_clients.base_client INFO POST call: http://127.0.0.1:10775/filter/v2/subscriptions with payload: {"requestId": "9a2c0aeb-b4c5-4cf5-b1b5-d826c4ccf5fb", "contentFilters": ["/waku/2/rs/0/1"], "pubsubTopic": "/waku/2/rs/0/1"}
10:30:49 tests.filter.test_subscribe_create ERROR PubsubTopic /waku/2/rs/0/1 failed: HTTPConnectionPool(host='127.0.0.1', port=10775): Read timed out. (read timeout=60)

@chaitanyaprem
Copy link
Collaborator

Regarding the REST API add peer endpoint, are there similar specs for gowaku so I can see how it works?

We don't have a similar page, but the yaml file for this can be found at https://github.com/waku-org/go-waku/blob/master/cmd/waku/server/rest/admin_api.yaml. You can refer to the details expected in request and response. You can import this file into postman to get sample req/resp structures

@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Dec 29, 2023

I do use the filternode flag when starting the filter node.

I am not talking about --filter flag rather --filternode flag.
--filter flag enables the node to act as as filter server/service node.
--filternode flag is to tell a filterClient which filterNode to connect to. This node need not act as a filter server again.

From the logs i don't see --filternode flag being specified, which is why i am guessing discovery could be causing the delay. Will go through the logs in detail to analyze what might be happening.

In the meantime it would be helpful to know your test setup for this case.

@fbarbu15
Copy link
Author

the 2nd node is using '--filternode=/ip4/172.18.1.197/tcp/37804/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip' in the logs above

@fbarbu15
Copy link
Author

Regarding the test setup it should be all in those logs.
I'm starting 2 nodes (first relay + filter), 2nd just filter and connect them via filter protocol.
After that the relay node is subscribing to this topic:

10:29:39 src.node.api_clients.base_client INFO POST call: http://127.0.0.1:37802/relay/v1/subscriptions with payload: ["/waku/2/rs/0/1"]
10:29:39 src.node.api_clients.base_client INFO Response status code: 200. Response content: b'true'

Then a 10 sec propagation sleep:
10:29:39 src.libs.common DEBUG Sleeping for 10 seconds
and the filter node tries to subscribe but the request times out after 60 seconds (the timeout I've set to the requests client)

10:29:49 src.node.api_clients.base_client INFO POST call: http://127.0.0.1:10775/filter/v2/subscriptions with payload: {"requestId": "9a2c0aeb-b4c5-4cf5-b1b5-d826c4ccf5fb", "contentFilters": ["/waku/2/rs/0/1"], "pubsubTopic": "/waku/2/rs/0/1"}
10:30:49 tests.filter.test_subscribe_create ERROR PubsubTopic /waku/2/rs/0/1 failed: HTTPConnectionPool(host='127.0.0.1', port=10775): Read timed out. (read timeout=60)

@chaitanyaprem
Copy link
Collaborator

the 2nd node is using '--filternode=/ip4/172.18.1.197/tcp/37804/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip' in the logs above

Ah..somehow i missed it. Thanks

@chaitanyaprem
Copy link
Collaborator

Can you rerun this by subscribing the nodes to pubsubTopic /ws/rs/2/0/1 via command-line? This will ensure the node won't subscribe to default named pubsubTopic.
I am wondering if this is falling into similar category as i have explained here

Just wanted to eliminate 1 more variable to narrow down scope of the potential issue.

@fbarbu15
Copy link
Author

fbarbu15 commented Jan 3, 2024

I tried to test this but there's a regression that blocks go-waku tests: #988

@chaitanyaprem
Copy link
Collaborator

chaitanyaprem commented Jan 8, 2024

Weekly Update

  • achieved:fix panic when static peer is also discovered dynamically, fix ENR Waku field population for Fitler, analyze data races identified

@fbarbu15
Copy link
Author

fbarbu15 commented Jan 8, 2024

I tried starting the Relay node with pubsubTopic /waku/2/rs/0/1 and this fixes the problem, the filter/v2/subscriptions to /waku/2/rs/0/1 is done in milliseconds.
So this flag is indeed responsible for this issue. For ex If instead of /waku/2/rs/0/1 I start the node with /waku/2/rs/0/2 the issue happens again, the filter/v2/subscriptions with /waku/2/rs/0/1 takes a lot

@chaitanyaprem
Copy link
Collaborator

So this flag is indeed responsible for this issue. For ex If instead of /waku/2/rs/0/1 I start the node with /waku/2/rs/0/2 the issue happens again, the filter/v2/subscriptions with /waku/2/rs/0/1 takes a lot

Does it timeout? If so that means the other node is not supporting that shard. Otherwise discovery should happen and it should still work. Can you confirm?

@fbarbu15
Copy link
Author

fbarbu15 commented Jan 8, 2024

Yes, the first filter/v2/subscriptions request will timeout. But after the timeout, if I retry, it will work immediately.
On the other hand, if the shard is not supported (for ex the subscription request is done on /waku/2/rs/1/0 ) all requests will timeout, not just the 1st.
So the problem here is that the 1st request after node start will timeout

@chaitanyaprem
Copy link
Collaborator

Yes, the first filter/v2/subscriptions request will timeout. But after the timeout, if I retry, it will work immediately. On the other hand, if the shard is not supported (for ex the subscription request is done on /waku/2/rs/1/0 ) all requests will timeout, not just the 1st. So the problem here is that the 1st request after node start will timeout

That is interesting, do share logs where first request times-out for both the nodes.

@fbarbu15
Copy link
Author

fbarbu15 commented Jan 9, 2024

nodes.zip
First request is done at 11:12:35 and it times out after 50 seconds (the timeout setup in the test client).
The 2nd one is done at 11:13:26 and it's succesfull

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants