Allow specify number of possible unavailable nodes when rerouting #6399

Sinketsu · 2024-06-03T17:59:02Z

Is your feature request related to a problem? Please describe

Hello!
We have 5 vmstorage nodes in one AZ. When an accident occurs in this AZ, in which part of the servers becomes unavailable, a situation may occur when there will be an insufficient number of servers left to take out the entire load.
For example, if 4 of 5 vmstorage will be unavailable, all traffic will be rerouted to 1 server. This server will not withstand such a heavy load, it will slow down both writing and reading.

Describe the solution you'd like

Allow to specify the maximum number of servers, if unavailable, the rerouting will take place.

Thanks)

Describe alternatives you've considered

Now we have only two options:

Disable rerouting on unavailable, but in this case, we will suffer when updating servers sequentially (during rolling upgrade)
Enable rerouting on unavailable, but in this case, the write and read performance will suffer

Additional information

No response

The text was updated successfully, but these errors were encountered:

Haleygo · 2024-06-05T04:48:34Z

Hello!

Disable rerouting on unavailable, but in this case, we will suffer when updating servers sequentially (during rolling upgrade)

Why will you suffer here? New vmstorage node should take few minutes to be ready during rolling upgrade, remotewrite client like vmagent or prometheus should be able to buffer unsuccessful write requests and resend them when vmstorage is back.

For example, if 4 of 5 vmstorage will be unavailable, all traffic will be rerouted to 1 server. This server will not withstand such a heavy load, it will slow down both writing and reading.

You mean having an option like rerouteMaxUnavailableNodeTolerance=3, re-route is diabled when 4 out of 5 vmstorage nodes are down? I don't see how this option help with cluster availability though:

if re-route is enabled, it's likely that the only available vmstorage node can't handle the load and crash;
if re-route is disabled, the query results are very likely partial(single node only contain 20% of series if -replicationFactor=1) and unreliable.

So in both cases, read and write requests are failed, and vmstorage nodes must be fixed to serve.

Sinketsu · 2024-06-05T07:19:09Z

Hello!

Disable rerouting on unavailable, but in this case, we will suffer when updating servers sequentially (during rolling upgrade)

Why will you suffer here? New vmstorage node should take few minutes to be ready during rolling upgrade, remotewrite client like vmagent or prometheus should be able to buffer unsuccessful write requests and resend them when vmstorage is back.

If the update goes through without problems, then yes, the buffer on the agents will save us. But this requires a fairly large buffer on agents, which is difficult for us to do. And there may also be situations when the server goes out for maintenance for a longer time - for example, several hours. In this case, we would like not to lose data, since the remaining servers will be able to take out the load.

For example, if 4 of 5 vmstorage will be unavailable, all traffic will be rerouted to 1 server. This server will not withstand such a heavy load, it will slow down both writing and reading.

You mean having an option like rerouteMaxUnavailableNodeTolerance=3, re-route is diabled when 4 out of 5 vmstorage nodes are down? I don't see how this option help with cluster availability though:

if re-route is enabled, it's likely that the only available vmstorage node can't handle the load and crash;

if re-route is disabled, the query results are very likely partial(single node only contain 20% of series if -replicationFactor=1) and unreliable.

So in both cases, read and write requests are failed, and vmstorage nodes must be fixed to serve.

In the second case the data will be marked as partial, but it's better than no response at all due to congestion. In this case we can retry request to another AZ, or merge data from another AZ (depends on the selected vmselect operation scheme).
And it is much more important that the servers themselves will not suffer from a large write flow. Now the server in such a situation may become unavailable due to the large utilization of the CPU.

Haleygo · 2024-06-05T08:03:26Z

but it's better than no response at all due to congestion.

I don't think wrong result is better than no response, and anomaly can be noticed quicker when there is no response.

In this case we can retry request to another AZ, or merge data from another AZ (depends on the selected vmselect operation scheme).

If there is another AZ, in this case, you should switch to the second AZ directly. No matter the vmstorage nodes in first AZ is partially down(partial response) or totally down(no response), otherwise, you got wrong results for users and rule evaluation.

And it's unclear how to set rerouteMaxUnavailableNodeTolerance for big cluster, how to estimate that N nodes down is ok, but N+1 nodes down is unacceptable.

Sinketsu · 2024-06-05T12:52:28Z

If there is another AZ, in this case, you should switch to the second AZ directly. No matter the vmstorage nodes in first AZ is partially down(partial response) or totally down(no response), otherwise, you got wrong results for users and rule evaluation.

Yes, we can switch to another AZ, but we would like this to happen automatically. We are currently using a single vmselect cluster over multiple AZ, as each AZ may be unavailable for some time (more, than buffer can hold). And in such a scheme, we will wait a very long time for a response from the problem AZ due to server overload.

And it's unclear how to set rerouteMaxUnavailableNodeTolerance for big cluster, how to estimate that N nodes down is ok, but N+1 nodes down is unacceptable.

It seems that this can be determined empirically by the system administrators who maintain this cluster.

Haleygo · 2024-06-05T14:39:18Z

Yes, we can switch to another AZ, but we would like this to happen automatically. We are currently using a single vmselect cluster over multiple AZ, as each AZ may be unavailable for some time (more, than buffer can hold).

I would recommend to use seperated vmselect for each AZ, and use vmauth as proxy in front of vmselect, the topology is like this.

vmselect should be configured with -search.denyPartialResponse=true, vmauth uses first_available policyand will auto-switch to the second AZ when AZ1 returns partial responses.
Some pros of this topology:

less pressure on vmselect, as there is only 50% of data compare to connecting both vmcluster;
less cross-AZ network traffic, you can always set the "local" vmcluster as your first available server.
See similar usage in https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-distributed.

It seems that this can be determined empirically by the system administrators who maintain this cluster.

I don't think it's easy to do, and it's hard to provide actionable recommendation for users.

Sinketsu · 2024-06-05T15:26:40Z

I would recommend to use seperated vmselect for each AZ, and use vmauth as proxy in front of vmselect, the topology is like this.

We can't.
We may have one AZ unavailable for a long time. During this period of time, there will be no metrics at all in this AZ. vmauth will not be able to detect such a problem, so it will send requests to this zone, which will lead to incorrect display of dashboards (there will be data gaps) and alerts.

Haleygo · 2024-06-06T01:40:47Z

We may have one AZ unavailable for a long time. During this period of time, there will be no metrics at all in this AZ.

vmselect with -search.denyPartialResponse=true will fail query requests if more than replicationFactor-1 vmstorage node is unavailable, then vmauth will mark this AZ as broken and use another AZ.
If storage nodes on AZ1 are all fixed but old data haven't been backfill, remove AZ1 vmselect address in vmauth config until the data is fixed, it's pretty handy since vmauth can be hot loaded.

Sinketsu · 2024-06-06T09:18:32Z

We may have one AZ unavailable for a long time. During this period of time, there will be no metrics at all in this AZ.

If storage nodes on AZ1 are all fixed but old data haven't been backfill, remove AZ1 vmselect address in vmauth config until the data is fixed, it's pretty handy since vmauth can be hot loaded.

This requires constant manual manipulation of data and configs. And I would like the system to respond to this automatically.

Sinketsu added the enhancement New feature or request label Jun 3, 2024

Haleygo added the question The question issue label Jun 5, 2024

Haleygo self-assigned this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow specify number of possible unavailable nodes when rerouting #6399

Allow specify number of possible unavailable nodes when rerouting #6399

Sinketsu commented Jun 3, 2024

Haleygo commented Jun 5, 2024 •

edited

Loading

Sinketsu commented Jun 5, 2024

Haleygo commented Jun 5, 2024

Sinketsu commented Jun 5, 2024

Haleygo commented Jun 5, 2024

Sinketsu commented Jun 5, 2024

Haleygo commented Jun 6, 2024

Sinketsu commented Jun 6, 2024

Allow specify number of possible unavailable nodes when rerouting #6399

Allow specify number of possible unavailable nodes when rerouting #6399

Comments

Sinketsu commented Jun 3, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Haleygo commented Jun 5, 2024 • edited Loading

Sinketsu commented Jun 5, 2024

Haleygo commented Jun 5, 2024

Sinketsu commented Jun 5, 2024

Haleygo commented Jun 5, 2024

Sinketsu commented Jun 5, 2024

Haleygo commented Jun 6, 2024

Sinketsu commented Jun 6, 2024

Haleygo commented Jun 5, 2024 •

edited

Loading