Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specify number of possible unavailable nodes when rerouting #6399

Open
Sinketsu opened this issue Jun 3, 2024 · 8 comments
Open

Allow specify number of possible unavailable nodes when rerouting #6399

Sinketsu opened this issue Jun 3, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request question The question issue

Comments

@Sinketsu
Copy link

Sinketsu commented Jun 3, 2024

Is your feature request related to a problem? Please describe

Hello!
We have 5 vmstorage nodes in one AZ. When an accident occurs in this AZ, in which part of the servers becomes unavailable, a situation may occur when there will be an insufficient number of servers left to take out the entire load.
For example, if 4 of 5 vmstorage will be unavailable, all traffic will be rerouted to 1 server. This server will not withstand such a heavy load, it will slow down both writing and reading.

Describe the solution you'd like

Allow to specify the maximum number of servers, if unavailable, the rerouting will take place.

Thanks)

Describe alternatives you've considered

Now we have only two options:

  • Disable rerouting on unavailable, but in this case, we will suffer when updating servers sequentially (during rolling upgrade)
  • Enable rerouting on unavailable, but in this case, the write and read performance will suffer

Additional information

No response

@Sinketsu Sinketsu added the enhancement New feature or request label Jun 3, 2024
@Haleygo
Copy link
Contributor

Haleygo commented Jun 5, 2024

Hello!

Disable rerouting on unavailable, but in this case, we will suffer when updating servers sequentially (during rolling upgrade)

Why will you suffer here? New vmstorage node should take few minutes to be ready during rolling upgrade, remotewrite client like vmagent or prometheus should be able to buffer unsuccessful write requests and resend them when vmstorage is back.

For example, if 4 of 5 vmstorage will be unavailable, all traffic will be rerouted to 1 server. This server will not withstand such a heavy load, it will slow down both writing and reading.

You mean having an option like rerouteMaxUnavailableNodeTolerance=3, re-route is diabled when 4 out of 5 vmstorage nodes are down? I don't see how this option help with cluster availability though:

  1. if re-route is enabled, it's likely that the only available vmstorage node can't handle the load and crash;
  2. if re-route is disabled, the query results are very likely partial(single node only contain 20% of series if -replicationFactor=1) and unreliable.

So in both cases, read and write requests are failed, and vmstorage nodes must be fixed to serve.

@Haleygo Haleygo added the question The question issue label Jun 5, 2024
@Haleygo Haleygo self-assigned this Jun 5, 2024
@Sinketsu
Copy link
Author

Sinketsu commented Jun 5, 2024

Hello!

Disable rerouting on unavailable, but in this case, we will suffer when updating servers sequentially (during rolling upgrade)

Why will you suffer here? New vmstorage node should take few minutes to be ready during rolling upgrade, remotewrite client like vmagent or prometheus should be able to buffer unsuccessful write requests and resend them when vmstorage is back.

If the update goes through without problems, then yes, the buffer on the agents will save us. But this requires a fairly large buffer on agents, which is difficult for us to do. And there may also be situations when the server goes out for maintenance for a longer time - for example, several hours. In this case, we would like not to lose data, since the remaining servers will be able to take out the load.

For example, if 4 of 5 vmstorage will be unavailable, all traffic will be rerouted to 1 server. This server will not withstand such a heavy load, it will slow down both writing and reading.

You mean having an option like rerouteMaxUnavailableNodeTolerance=3, re-route is diabled when 4 out of 5 vmstorage nodes are down? I don't see how this option help with cluster availability though:

  1. if re-route is enabled, it's likely that the only available vmstorage node can't handle the load and crash;
  2. if re-route is disabled, the query results are very likely partial(single node only contain 20% of series if -replicationFactor=1) and unreliable.

So in both cases, read and write requests are failed, and vmstorage nodes must be fixed to serve.

In the second case the data will be marked as partial, but it's better than no response at all due to congestion. In this case we can retry request to another AZ, or merge data from another AZ (depends on the selected vmselect operation scheme).
And it is much more important that the servers themselves will not suffer from a large write flow. Now the server in such a situation may become unavailable due to the large utilization of the CPU.

@Haleygo
Copy link
Contributor

Haleygo commented Jun 5, 2024

but it's better than no response at all due to congestion.

I don't think wrong result is better than no response, and anomaly can be noticed quicker when there is no response.

In this case we can retry request to another AZ, or merge data from another AZ (depends on the selected vmselect operation scheme).

If there is another AZ, in this case, you should switch to the second AZ directly. No matter the vmstorage nodes in first AZ is partially down(partial response) or totally down(no response), otherwise, you got wrong results for users and rule evaluation.

And it's unclear how to set rerouteMaxUnavailableNodeTolerance for big cluster, how to estimate that N nodes down is ok, but N+1 nodes down is unacceptable.

@Sinketsu
Copy link
Author

Sinketsu commented Jun 5, 2024

If there is another AZ, in this case, you should switch to the second AZ directly. No matter the vmstorage nodes in first AZ is partially down(partial response) or totally down(no response), otherwise, you got wrong results for users and rule evaluation.

Yes, we can switch to another AZ, but we would like this to happen automatically. We are currently using a single vmselect cluster over multiple AZ, as each AZ may be unavailable for some time (more, than buffer can hold). And in such a scheme, we will wait a very long time for a response from the problem AZ due to server overload.

And it's unclear how to set rerouteMaxUnavailableNodeTolerance for big cluster, how to estimate that N nodes down is ok, but N+1 nodes down is unacceptable.

It seems that this can be determined empirically by the system administrators who maintain this cluster.

@Haleygo
Copy link
Contributor

Haleygo commented Jun 5, 2024

Yes, we can switch to another AZ, but we would like this to happen automatically. We are currently using a single vmselect cluster over multiple AZ, as each AZ may be unavailable for some time (more, than buffer can hold).

I would recommend to use seperated vmselect for each AZ, and use vmauth as proxy in front of vmselect, the topology is like this.
image
vmselect should be configured with -search.denyPartialResponse=true, vmauth uses first_available policyand will auto-switch to the second AZ when AZ1 returns partial responses.
Some pros of this topology:

  1. less pressure on vmselect, as there is only 50% of data compare to connecting both vmcluster;
  2. less cross-AZ network traffic, you can always set the "local" vmcluster as your first available server.
    See similar usage in https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-distributed.

It seems that this can be determined empirically by the system administrators who maintain this cluster.

I don't think it's easy to do, and it's hard to provide actionable recommendation for users.

@Sinketsu
Copy link
Author

Sinketsu commented Jun 5, 2024

I would recommend to use seperated vmselect for each AZ, and use vmauth as proxy in front of vmselect, the topology is like this.

We can't.
We may have one AZ unavailable for a long time. During this period of time, there will be no metrics at all in this AZ. vmauth will not be able to detect such a problem, so it will send requests to this zone, which will lead to incorrect display of dashboards (there will be data gaps) and alerts.

@Haleygo
Copy link
Contributor

Haleygo commented Jun 6, 2024

We may have one AZ unavailable for a long time. During this period of time, there will be no metrics at all in this AZ.

vmselect with -search.denyPartialResponse=true will fail query requests if more than replicationFactor-1 vmstorage node is unavailable, then vmauth will mark this AZ as broken and use another AZ.
If storage nodes on AZ1 are all fixed but old data haven't been backfill, remove AZ1 vmselect address in vmauth config until the data is fixed, it's pretty handy since vmauth can be hot loaded.

@Sinketsu
Copy link
Author

Sinketsu commented Jun 6, 2024

We may have one AZ unavailable for a long time. During this period of time, there will be no metrics at all in this AZ.

If storage nodes on AZ1 are all fixed but old data haven't been backfill, remove AZ1 vmselect address in vmauth config until the data is fixed, it's pretty handy since vmauth can be hot loaded.

This requires constant manual manipulation of data and configs. And I would like the system to respond to this automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants