-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/nodes response time increases as we increase the number of offline nodes #1435
Comments
Thanks, I'll see if I can reproduce. Any fix will only be present in 8.0, just so you know. |
In other words, I don't plan to make any more releases for the 7.x series. |
If I run a quick local test -- run 3 nodes, kill 1, and call nodes I get: $ curl 'localhost:4001/nodes?pretty&ver=2'
{
"nodes": [
{
"id": "1",
"api_addr": "http://localhost:4001",
"addr": "localhost:4002",
"voter": true,
"reachable": true,
"leader": true,
"time": 0.000013489
},
{
"id": "2",
"api_addr": "http://localhost:4003",
"addr": "localhost:4004",
"voter": true,
"reachable": true,
"leader": false,
"time": 0.000117832
},
{
"id": "3",
"addr": "localhost:4006",
"voter": true,
"reachable": false,
"leader": false,
"error": "pool get: dial tcp [::1]:4006: connect: connection refused"
}
]
} it returns instantly. I'm not familiar with the error you're seeing "....actively refused it." I wonder if the error handling code or networking setup is not dealing with your network setup correctly. Is there something special about how your nodes are networked together? What can you tell me @dwco-z ? |
Actually, I might have repro'ed something similar. If I kill a node, but then do |
OK, I see what's going on. I need a higher-level timeout check to really capture all cases. This will be fixed in 8.0, but would be easy enough to backport to the 7.x series if you want to build that code yourself. You'd need to modify https://github.com/rqlite/rqlite/blob/v7.21.4/http/service.go#L1846 |
Just letting you know that I tested the master version and I can see that the timeout is being respected, thanks |
I hope to get out 8.0 soon, I've only one last change to get in, so hopefully 8.0 will be something you can use and take advantage of this improvement. |
What version are you running?
v7.14.3
Are you using Docker or Kubernetes to run your system?
No
Are you running a single node or a cluster?
Cluster
What did you do?
I'm testing /nodes endpoint with "/nodes?nonvoters&timeout=1s" in a windows machine with multiple rqlite instances. I noticed that the number of offline/unavailable machines in the cluster drastically increases the /nodes response time.
For example, if I have 8 nodes (all in 127.0.0.1) with 5 offline rqlite instances:
[Voter] - Online
[Voter] - Online
[Voter] - Online
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
[NonVoter] - Offline
It takes ~10 seconds to get a response from the leader node. If I increase the number of offline nodes, the response time increases.
What did you expect to happen?
I was expecting the response time to be smaller, specially because I'm using timeout=1s.
This is the response for the 8 nodes / 5 offline case after the ~10 seconds:
{
"127.0.0.1:9600": {
"api_addr": "https://127.0.0.1:9500",
"addr": "127.0.0.1:9600",
"reachable": true,
"leader": true
},
"127.0.0.1:9601": {
"api_addr": "https://127.0.0.1:9501",
"addr": "127.0.0.1:9601",
"reachable": true,
"leader": false,
"time": 0.1212747
},
"127.0.0.1:9602": {
"api_addr": "https://127.0.0.1:9502",
"addr": "127.0.0.1:9602",
"reachable": true,
"leader": false,
"time": 0.168375
},
"127.0.0.1:9603": {
"addr": "127.0.0.1:9603",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9603: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9604": {
"addr": "127.0.0.1:9604",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9604: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9605": {
"addr": "127.0.0.1:9605",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9605: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9606": {
"addr": "127.0.0.1:9606",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9606: connectex: No connection could be made because the target machine actively refused it."
},
"127.0.0.1:9607": {
"addr": "127.0.0.1:9607",
"reachable": false,
"leader": false,
"error": "factory is not able to fill the pool: dial tcp 127.0.0.1:9607: connectex: No connection could be made because the target machine actively refused it."
}
}
The first three nodes in the json response are the voters kept online and the five remaining are the non-voters that were stopped.
The text was updated successfully, but these errors were encountered: