-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow specify number of possible unavailable nodes when rerouting #6399
Comments
Hello!
Why will you suffer here? New vmstorage node should take few minutes to be ready during rolling upgrade, remotewrite client like vmagent or prometheus should be able to buffer unsuccessful write requests and resend them when vmstorage is back.
You mean having an option like
So in both cases, read and write requests are failed, and vmstorage nodes must be fixed to serve. |
If the update goes through without problems, then yes, the buffer on the agents will save us. But this requires a fairly large buffer on agents, which is difficult for us to do. And there may also be situations when the server goes out for maintenance for a longer time - for example, several hours. In this case, we would like not to lose data, since the remaining servers will be able to take out the load.
In the second case the data will be marked as partial, but it's better than no response at all due to congestion. In this case we can retry request to another AZ, or merge data from another AZ (depends on the selected vmselect operation scheme). |
I don't think wrong result is better than no response, and anomaly can be noticed quicker when there is no response.
If there is another AZ, in this case, you should switch to the second AZ directly. No matter the vmstorage nodes in first AZ is partially down(partial response) or totally down(no response), otherwise, you got wrong results for users and rule evaluation. And it's unclear how to set |
Yes, we can switch to another AZ, but we would like this to happen automatically. We are currently using a single vmselect cluster over multiple AZ, as each AZ may be unavailable for some time (more, than buffer can hold). And in such a scheme, we will wait a very long time for a response from the problem AZ due to server overload.
It seems that this can be determined empirically by the system administrators who maintain this cluster. |
I would recommend to use seperated vmselect for each AZ, and use vmauth as proxy in front of vmselect, the topology is like this.
I don't think it's easy to do, and it's hard to provide actionable recommendation for users. |
We can't. |
vmselect with |
This requires constant manual manipulation of data and configs. And I would like the system to respond to this automatically. |
Is your feature request related to a problem? Please describe
Hello!
We have 5
vmstorage
nodes in one AZ. When an accident occurs in this AZ, in which part of the servers becomes unavailable, a situation may occur when there will be an insufficient number of servers left to take out the entire load.For example, if 4 of 5
vmstorage
will be unavailable, all traffic will be rerouted to 1 server. This server will not withstand such a heavy load, it will slow down both writing and reading.Describe the solution you'd like
Allow to specify the maximum number of servers, if unavailable, the rerouting will take place.
Thanks)
Describe alternatives you've considered
Now we have only two options:
Additional information
No response
The text was updated successfully, but these errors were encountered: