-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A simple algorithm for fully utilizing bandwidth for scaling in #4137
Comments
For #4099, #3730 already do some improvements. I don't consider evict all leaders of offline store because I think offline stores can provide some computing resources to take some tasks, such as generator snapshots. For this issue, In the current PD control model, the |
I have repeatedly test it these days with different workloads. I can always see the jitters bring by deleting stores without evicting leaders. What's the case that latency will not be affected?
The advice of evicting leaders has been widely adopted in production deployment. Actually, it's one of the practice guide to scale in a node in TiDB cloud.
In my tests, all other nodes don't have as much compactions as deleting regions. There are two reasons:
Agree. But this just is complicated compared to "evicting leaders before deleting a store".
I'm not sure whether the store limit is high, I use the default settings, which is 15. I think the problem here is scheduling should based on bandwidth instead of concurrent tasks. It's clearly wrong when some disk still have available bandwidth while others are busy. |
Feature Request
Describe your feature request related problem
Following is the IO utilization in an experiment that generating snapshot is optimized to cost short constant time and 0 bandwidth.
The bandwidth used for snapshot replication is set to 50MiB/s. And sending snapshot is counted with factor 1/2 as the data may be in cache already, receiving snapshot is counted with factor 1 as data always have to be written to disk. There are four nodes, and one is being deleted. An optimal scheduling algorithm should utilize all bandwidth, that is 50 * 4 = 200MiB/s. So receiving snapshot should take about 200 * 2 / 3 = 133MiB/s. But the graph above shows that receiving snapshot can only fully utilize the bandwidth at some time.
Describe the feature you'd like
A simple algorithm to fully utilize bandwidth is to make every node both sending and receiving snapshot at the same time. Just making one node keep sending multiple snapshots without receiving may not be optimal as the they may sending to the same other node, so the bandwidth utilization can only be 50 / 2 = 25MiB/s.
Based on the advice from #4099, we may not be able to let offline node to send or receive snapshots as all leaders are evicted. But thanks to tikv/raft-rs#135, we can still make offline nodes keep sending snapshots. So if #4099 is implemented, the we can simply make offline stores send snapshots to all up nodes, and every up nodes should keep sending and receiving snapshots at the same time. I have developed the idea as a simple script and I can see the bandwidth is constantly at limit before offline store is about to get tombstone.
Note we don't need high store limit to fully utilize bandwidth. In my scripts, the store limit for every up nodes is 2.
A complicated way to fully utilize bandwidth is to setup a model to predict bandwidth at each node at real time.
Describe alternatives you've considered
None
Teachability, Documentation, Adoption, Migration Strategy
None
The text was updated successfully, but these errors were encountered: