/nodes response time increases as we increase the number of offline nodes #1435

dwco-z · 2023-11-29T14:42:18Z

What version are you running?
v7.14.3

Are you using Docker or Kubernetes to run your system?
No

Are you running a single node or a cluster?
Cluster

What did you do?
I'm testing /nodes endpoint with "/nodes?nonvoters&timeout=1s" in a windows machine with multiple rqlite instances. I noticed that the number of offline/unavailable machines in the cluster drastically increases the /nodes response time.
For example, if I have 8 nodes (all in 127.0.0.1) with 5 offline rqlite instances:
[Voter] - Online
[Voter] - Online
[Voter] - Online
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
It takes ~10 seconds to get a response from the leader node. If I increase the number of offline nodes, the response time increases.

What did you expect to happen?
I was expecting the response time to be smaller, specially because I'm using timeout=1s.

This is the response for the 8 nodes / 5 offline case after the ~10 seconds:
{
"127.0.0.1:9600": {
"api_addr": "https://127.0.0.1:9500",
"addr": "127.0.0.1:9600",
"reachable": true,
"leader": true
},
"127.0.0.1:9601": {
"api_addr": "https://127.0.0.1:9501",
"addr": "127.0.0.1:9601",
"reachable": true,
"leader": false,
"time": 0.1212747
},
"127.0.0.1:9602": {
"api_addr": "https://127.0.0.1:9502",
"addr": "127.0.0.1:9602",
"reachable": true,
"leader": false,
"time": 0.168375
},
"127.0.0.1:9603": {
"addr": "127.0.0.1:9603",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9603: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9604": {
"addr": "127.0.0.1:9604",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9604: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9605": {
"addr": "127.0.0.1:9605",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9605: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9606": {
"addr": "127.0.0.1:9606",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9606: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9607": {
"addr": "127.0.0.1:9607",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9607: connectex: No connection could be made because the target machine actively refused it."
}
}

The first three nodes in the json response are the voters kept online and the five remaining are the non-voters that were stopped.

otoolep · 2023-11-29T14:50:43Z

Thanks, I'll see if I can reproduce. Any fix will only be present in 8.0, just so you know.

otoolep · 2023-11-29T14:51:05Z

In other words, I don't plan to make any more releases for the 7.x series.

otoolep · 2023-12-01T13:28:03Z

If I run a quick local test -- run 3 nodes, kill 1, and call nodes I get:

$ curl 'localhost:4001/nodes?pretty&ver=2'
{
    "nodes": [
        {
            "id": "1",
            "api_addr": "http://localhost:4001",
            "addr": "localhost:4002",
            "voter": true,
            "reachable": true,
            "leader": true,
            "time": 0.000013489
        },
        {
            "id": "2",
            "api_addr": "http://localhost:4003",
            "addr": "localhost:4004",
            "voter": true,
            "reachable": true,
            "leader": false,
            "time": 0.000117832
        },
        {
            "id": "3",
            "addr": "localhost:4006",
            "voter": true,
            "reachable": false,
            "leader": false,
            "error": "pool get: dial tcp [::1]:4006: connect: connection refused"
        }
    ]
}

it returns instantly.

I'm not familiar with the error you're seeing "....actively refused it." I wonder if the error handling code or networking setup is not dealing with your network setup correctly. Is there something special about how your nodes are networked together? What can you tell me @dwco-z ?

otoolep · 2023-12-01T13:32:11Z

Actually, I might have repro'ed something similar. If I kill a node, but then do nc -l 4006 so there is an open TCP port, timeout doesn't seem to work. I see a 5 second delay before nodes/ returns.

otoolep · 2023-12-01T13:52:01Z

OK, I see what's going on. I need a higher-level timeout check to really capture all cases. This will be fixed in 8.0, but would be easy enough to backport to the 7.x series if you want to build that code yourself. You'd need to modify https://github.com/rqlite/rqlite/blob/v7.21.4/http/service.go#L1846

dwco-z · 2023-12-01T17:09:12Z

Just letting you know that I tested the master version and I can see that the timeout is being respected, thanks

otoolep · 2023-12-01T17:28:24Z

I hope to get out 8.0 soon, I've only one last change to get in, so hopefully 8.0 will be something you can use and take advantage of this improvement.

otoolep linked a pull request Dec 1, 2023 that will close this issue

Use top-level timeout check for nodes/ #1437

Merged

otoolep closed this as completed in #1437 Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/nodes response time increases as we increase the number of offline nodes #1435

/nodes response time increases as we increase the number of offline nodes #1435

dwco-z commented Nov 29, 2023

otoolep commented Nov 29, 2023

otoolep commented Nov 29, 2023

otoolep commented Dec 1, 2023

otoolep commented Dec 1, 2023

otoolep commented Dec 1, 2023 •

edited

dwco-z commented Dec 1, 2023

otoolep commented Dec 1, 2023

/nodes response time increases as we increase the number of offline nodes #1435

/nodes response time increases as we increase the number of offline nodes #1435

Comments

dwco-z commented Nov 29, 2023

otoolep commented Nov 29, 2023

otoolep commented Nov 29, 2023

otoolep commented Dec 1, 2023

otoolep commented Dec 1, 2023

otoolep commented Dec 1, 2023 • edited

dwco-z commented Dec 1, 2023

otoolep commented Dec 1, 2023

otoolep commented Dec 1, 2023 •

edited