Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/nodes response time increases as we increase the number of offline nodes #1435

Closed
dwco-z opened this issue Nov 29, 2023 · 7 comments · Fixed by #1437
Closed

/nodes response time increases as we increase the number of offline nodes #1435

dwco-z opened this issue Nov 29, 2023 · 7 comments · Fixed by #1437

Comments

@dwco-z
Copy link

dwco-z commented Nov 29, 2023

What version are you running?
v7.14.3

Are you using Docker or Kubernetes to run your system?
No

Are you running a single node or a cluster?
Cluster

What did you do?
I'm testing /nodes endpoint with "/nodes?nonvoters&timeout=1s" in a windows machine with multiple rqlite instances. I noticed that the number of offline/unavailable machines in the cluster drastically increases the /nodes response time.
For example, if I have 8 nodes (all in 127.0.0.1) with 5 offline rqlite instances:
[Voter] - Online
[Voter] - Online
[Voter] - Online
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
It takes ~10 seconds to get a response from the leader node. If I increase the number of offline nodes, the response time increases.

What did you expect to happen?
I was expecting the response time to be smaller, specially because I'm using timeout=1s.

This is the response for the 8 nodes / 5 offline case after the ~10 seconds:
{
"127.0.0.1:9600": {
"api_addr": "https://127.0.0.1:9500",
"addr": "127.0.0.1:9600",
"reachable": true,
"leader": true
},
"127.0.0.1:9601": {
"api_addr": "https://127.0.0.1:9501",
"addr": "127.0.0.1:9601",
"reachable": true,
"leader": false,
"time": 0.1212747
},
"127.0.0.1:9602": {
"api_addr": "https://127.0.0.1:9502",
"addr": "127.0.0.1:9602",
"reachable": true,
"leader": false,
"time": 0.168375
},
"127.0.0.1:9603": {
"addr": "127.0.0.1:9603",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9603: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9604": {
"addr": "127.0.0.1:9604",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9604: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9605": {
"addr": "127.0.0.1:9605",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9605: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9606": {
"addr": "127.0.0.1:9606",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9606: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9607": {
"addr": "127.0.0.1:9607",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9607: connectex: No connection could be made because the target machine actively refused it."
}
}

The first three nodes in the json response are the voters kept online and the five remaining are the non-voters that were stopped.

@otoolep
Copy link
Member

otoolep commented Nov 29, 2023

Thanks, I'll see if I can reproduce. Any fix will only be present in 8.0, just so you know.

@otoolep
Copy link
Member

otoolep commented Nov 29, 2023

In other words, I don't plan to make any more releases for the 7.x series.

@otoolep
Copy link
Member

otoolep commented Dec 1, 2023

If I run a quick local test -- run 3 nodes, kill 1, and call nodes I get:

$ curl 'localhost:4001/nodes?pretty&ver=2'
{
    "nodes": [
        {
            "id": "1",
            "api_addr": "http://localhost:4001",
            "addr": "localhost:4002",
            "voter": true,
            "reachable": true,
            "leader": true,
            "time": 0.000013489
        },
        {
            "id": "2",
            "api_addr": "http://localhost:4003",
            "addr": "localhost:4004",
            "voter": true,
            "reachable": true,
            "leader": false,
            "time": 0.000117832
        },
        {
            "id": "3",
            "addr": "localhost:4006",
            "voter": true,
            "reachable": false,
            "leader": false,
            "error": "pool get: dial tcp [::1]:4006: connect: connection refused"
        }
    ]
}

it returns instantly.

I'm not familiar with the error you're seeing "....actively refused it." I wonder if the error handling code or networking setup is not dealing with your network setup correctly. Is there something special about how your nodes are networked together? What can you tell me @dwco-z ?

@otoolep
Copy link
Member

otoolep commented Dec 1, 2023

Actually, I might have repro'ed something similar. If I kill a node, but then do nc -l 4006 so there is an open TCP port, timeout doesn't seem to work. I see a 5 second delay before nodes/ returns.

@otoolep otoolep linked a pull request Dec 1, 2023 that will close this issue
@otoolep
Copy link
Member

otoolep commented Dec 1, 2023

OK, I see what's going on. I need a higher-level timeout check to really capture all cases. This will be fixed in 8.0, but would be easy enough to backport to the 7.x series if you want to build that code yourself. You'd need to modify https://github.com/rqlite/rqlite/blob/v7.21.4/http/service.go#L1846

@dwco-z
Copy link
Author

dwco-z commented Dec 1, 2023

Just letting you know that I tested the master version and I can see that the timeout is being respected, thanks

@otoolep
Copy link
Member

otoolep commented Dec 1, 2023

I hope to get out 8.0 soon, I've only one last change to get in, so hopefully 8.0 will be something you can use and take advantage of this improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants