Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A simple algorithm for fully utilizing bandwidth for scaling in #4137

Open
BusyJay opened this issue Sep 17, 2021 · 2 comments
Open

A simple algorithm for fully utilizing bandwidth for scaling in #4137

BusyJay opened this issue Sep 17, 2021 · 2 comments
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@BusyJay
Copy link
Member

BusyJay commented Sep 17, 2021

Feature Request

Describe your feature request related problem

Following is the IO utilization in an experiment that generating snapshot is optimized to cost short constant time and 0 bandwidth.
image

The bandwidth used for snapshot replication is set to 50MiB/s. And sending snapshot is counted with factor 1/2 as the data may be in cache already, receiving snapshot is counted with factor 1 as data always have to be written to disk. There are four nodes, and one is being deleted. An optimal scheduling algorithm should utilize all bandwidth, that is 50 * 4 = 200MiB/s. So receiving snapshot should take about 200 * 2 / 3 = 133MiB/s. But the graph above shows that receiving snapshot can only fully utilize the bandwidth at some time.

Describe the feature you'd like

A simple algorithm to fully utilize bandwidth is to make every node both sending and receiving snapshot at the same time. Just making one node keep sending multiple snapshots without receiving may not be optimal as the they may sending to the same other node, so the bandwidth utilization can only be 50 / 2 = 25MiB/s.

Based on the advice from #4099, we may not be able to let offline node to send or receive snapshots as all leaders are evicted. But thanks to tikv/raft-rs#135, we can still make offline nodes keep sending snapshots. So if #4099 is implemented, the we can simply make offline stores send snapshots to all up nodes, and every up nodes should keep sending and receiving snapshots at the same time. I have developed the idea as a simple script and I can see the bandwidth is constantly at limit before offline store is about to get tombstone.

Note we don't need high store limit to fully utilize bandwidth. In my scripts, the store limit for every up nodes is 2.

A complicated way to fully utilize bandwidth is to setup a model to predict bandwidth at each node at real time.

Describe alternatives you've considered

None

Teachability, Documentation, Adoption, Migration Strategy

None

@BusyJay BusyJay added the type/enhancement The issue or PR belongs to an enhancement. label Sep 17, 2021
@nolouch
Copy link
Contributor

nolouch commented Sep 18, 2021

For #4099, #3730 already do some improvements. I don't consider evict all leaders of offline store because I think offline stores can provide some computing resources to take some tasks, such as generator snapshots.
For evict all leader when offline in #4099 is just one case that we observed latency increase,and also have some cases the latency is not changed. The trade-off here may be different for each user, and maybe it cannot be used as a good rule to guide. Moreover, the migration of the leader to other nodes may also be bad,the compaction caused by the task of the add peer of the migration node may also affect the service. In general, what we lack is the ability to recognize, perceive and control the system. it's better to apply these rules based on some feedback information.

For this issue, In the current PD control model, the store limit also is the knobs to control the io task at the same time. if store limit is large(Need to be considered relative to the size of the region), may only one store that with (the highest score or lowest score)has the chance to migrate region, which causes io to unbalance. it‘s better to turn the store limit to smaller in the large region scenario. But this is based on a relatively random leader distribution, so it works better when the region is small enough. if the correlation between io and the number of tasks is not significant, other methods may need to consider.

@BusyJay
Copy link
Member Author

BusyJay commented Sep 18, 2021

For evict all leader when offline in #4099 is just one case that we observed latency increase,and also have some cases the latency is not changed.

I have repeatedly test it these days with different workloads. I can always see the jitters bring by deleting stores without evicting leaders. What's the case that latency will not be affected?

The trade-off here may be different for each user, and maybe he cannot be used as a good rule to guide.

The advice of evicting leaders has been widely adopted in production deployment. Actually, it's one of the practice guide to scale in a node in TiDB cloud.

the compaction caused by the task of the add peer of the migration node may also affect the service.

In my tests, all other nodes don't have as much compactions as deleting regions. There are two reasons:

  1. PD sets a very large store limit for the offline store, which will trigger more range deletions than all other store
  2. Deleting stores is the source of compaction, adding new files usually have little impact as they will be ingest to higher layers.

In general, what we lack is the ability to recognize, perceive and control the system. it's better to apply these rules based on some feedback information.

Agree. But this just is complicated compared to "evicting leaders before deleting a store".

if store limit is large(Need to be considered relative to the size of the region),

I'm not sure whether the store limit is high, I use the default settings, which is 15. I think the problem here is scheduling should based on bandwidth instead of concurrent tasks. It's clearly wrong when some disk still have available bandwidth while others are busy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

2 participants