Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add resource group for the read path #14001

Merged
merged 16 commits into from Jan 13, 2023
Merged

Conversation

glorv
Copy link
Contributor

@glorv glorv commented Dec 28, 2022

Signed-off-by: glorv glorvs@163.com

What is changed and how it works?

Issue Number: Ref #13730

What's Changed:

This PR introduces some changes to support resource control and priority-based task scheduling. This implementation only some the task scheduling for unified-read-pool, similar change for the write path will be a separated PR. The logic for maintain the newest resource group metadata also will be included in a separated PR.

The resource control is based on resource group. A resource group maintains the resource quota(CPU time, IO, ...) for a specific collection of request. These requests is mark by the same `ResourceGroupName` in the request context. Typically, a resource group can represent one user or tenant.

### Resource Control
This PR introduced a new component called `resource_control`. This component is used to maintain the metadata and running state of each resource group. TiKV fetches the resource group config metadata from PD and use it to drive task scheduling.

### Config
We introduce a new config submodule `resource_control`, currently there is only one config time indicate whether the resource control feature is enabled. This config does not support online config change currently.

### Task Scheduling
As tasks with different resource group tags share the same unified-read-pool, the new scheduler scheduling tasks based on virtual time. Tasks from a resource group that has smaller virtual time will be scheduled earlier than a bigger one. And after each task finished, we calculate it consumed resources(as a int value) and increase the related virtual time(with a factor). The resource consumed factor is calculated by the inverted value of resource group quota config. In this way, we can ensure task from all resource group are scheduled fairly with its quota portion.

Currently, In the read path, we only track the CPU time of read tasks, but we may also take other resource into account such as IO read bytes.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression
    • Consumes more CPU
    • Consumes more MEM
  • Breaking backward compatibility

Release note

Support priority-based scheduling  for read path

Signed-off-by: glorv <glorvs@163.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Dec 28, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Connor1996
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@glorv
Copy link
Contributor Author

glorv commented Dec 28, 2022

@BornChanger @HuSharp PTAL

@glorv
Copy link
Contributor Author

glorv commented Dec 28, 2022

depending on tikv/yatp#72

Signed-off-by: glorv <glorvs@163.com>
Copy link
Member

@HuSharp HuSharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to keep the HTTP interface? If we do, should we put it in the pr that provides the watcher mechanism?
BTW, @nolouch will change resource_manager.proto, Some changes may need to be made after it.

Comment on lines 51 to 63
fn get_ru_setting(setting: &GroupSettings, is_read: bool) -> f64 {
if setting.get_mode() == GroupMode::RuMode {
if is_read {
setting.get_r_u_settings().get_r_r_u().get_tokens()
} else {
setting.get_r_u_settings().get_w_r_u().get_tokens()
}
} else if is_read {
setting.get_resource_settings().get_cpu().get_tokens()
} else {
setting.get_resource_settings().get_io_write().get_tokens()
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about match (setting.get_mode(), is_read)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

let group1 = resouce_ctl.resource_group("test".as_bytes());
assert_eq!(group1.weight, 500);
let group2 = resouce_ctl.resource_group("test2".as_bytes());
assert_eq!(group1.weight, 250);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like group2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
@glorv
Copy link
Contributor Author

glorv commented Dec 29, 2022

/test

}

pub fn get_resource_group(&self, name: &str) -> Option<Ref<'_, String, GroupSettings>> {
self.resource_groups.get(&name.to_ascii_lowercase())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about restricting or converting on tidb side, it's wasteful to convert it to lowercase everytime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tidb will convert into lowercase then put it to resource manager(PD). That's mean the group name is case-insensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be only use to update or get the resource group meta, so it is not called very often. All methods under ResourceController won't do this convert because they are under the hot path.

fn gen_group_priority_factor(&self, setting: &GroupSettings, is_read: bool) -> u64 {
let ru_settings = Self::get_ru_setting(setting, is_read);
// TODO: ensure the result is a valid positive integer
(self.total_ru_quota / ru_settings * 10.0) as u64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(self.total_ru_quota / ru_settings * 10.0) as u64
u64::max(self.total_ru_quota / ru_settings * 10.0) as u64, 1_u64)

Signed-off-by: glorv <glorvs@163.com>
@@ -750,6 +752,11 @@ impl<E: Engine, L: LockManager, F: KvFormat> Storage<E, L, F> {
const CMD: CommandKind = CommandKind::batch_get_command;
// all requests in a batch have the same region, epoch, term, replica_read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure requests in a batch have the same resource group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is depend on the caller/use case. But in tidb I think in most case it it. In the common case, it unusual that two resource group access the same table region at the same time from the same tidb server. Though, in theory, this can happen, but I don't see the exact user case to do so.

@@ -1657,6 +1673,11 @@ impl<E: Engine, L: LockManager, F: KvFormat> Storage<E, L, F> {
const CMD: CommandKind = CommandKind::raw_batch_get_command;
// all requests in a batch have the same region, epoch, term, replica_read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

#[serde(rename_all = "kebab-case")]
pub struct Config {
#[online_config(skip)]
pub enabled: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add a sample in config-tempalte.toml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Connor1996 PTAL, do you feel ok with this new config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad name. Should be enable.

(GroupMode::RuMode, false) => setting.get_r_u_settings().get_w_r_u().get_tokens(),
// TODO: currently we only consider the cpu usage in the read path, we may also take
// io read bytes into account later.
(GroupMode::NativeMode, true) => setting.get_resource_settings().get_cpu().get_tokens(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the kvproto, changed NativeMode to Raw mode which comes from you. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs to change GroupSettings to ResourceGroup :)

Signed-off-by: glorv <glorvs@163.com>
impl ResourceGroupManager {
fn get_ru_setting(rg: &ResourceGroup, is_read: bool) -> f64 {
match (rg.get_mode(), is_read) {
(GroupMode::RuMode, true) => rg.get_r_u_settings().get_r_r_u().get_tokens(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should use fill_rate, the token is a state of the token bucket.

fn gen_group_priority_factor(&self, rg: &ResourceGroup, is_read: bool) -> u64 {
let ru_settings = Self::get_ru_setting(rg, is_read);
// TODO: ensure the result is a valid positive integer
(self.total_ru_quota / ru_settings * 10.0) as u64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should total ru quota sum all resource group settings? It's at the global level more reasonable.

Copy link
Contributor Author

@glorv glorv Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to total ru quota to be a reasonable big enough value so after cast (self.total_ru_quota / ru_settings) as integer, it doesn't loss much precision. I don't find a good way to set this value, so just use a big enough value by default.


impl GroupPriorityTracker {
fn get_priority(&self, level: usize) -> u64 {
// let level = match priority {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the code?

let mode = if is_ru_mode {
GroupMode::RuMode
} else {
GroupMode::NativeMode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GroupMode::NativeMode
GroupMode::RawMode

let group = GroupPriorityTracker {
weight: priority_factor,
virtual_time: AtomicU64::new(self.last_min_vt.load(Ordering::Acquire)),
vt_delta_for_get,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is write will use field? otherwise no need is_read field, so will two resource control for one resource group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write and read uses different ResouceController(and GroupPriorityTracker), only their config are the same.


// TODO: use different threshold for different resource type
// needn't do update if the virtual different is less than 100ms/100KB.
if min_vt + 100_000 >= max_vt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why choose 100ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is choosed by intuition. As we adjust the vt of all group's each second, we leave some groups' value unchanged if the value is near the biggest value, so we can keep the small schedule advantage between them. This can let the tail lantency a bit stabler.

let mut extras1 = Extras::single_level();
extras1.set_metadata("test".as_bytes().to_owned());
assert_eq!(resouce_ctl.priority_of(&extras1), 25_000);
assert_eq!(group1.current_vt(), 25_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean vt increased 25ms? will it weight too large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care much about the vt's factor, we just care about the proportion between different groups.


fn calculate_factor(max_quota: u64, quota: u64) -> u64 {
if quota > 0 {
(max_quota as f64 * 10.0 / quota as f64).round() as u64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

Signed-off-by: glorv <glorvs@163.com>
@glorv
Copy link
Contributor Author

glorv commented Jan 12, 2023

@sticnarf Could you please take a look

extras.set_metadata(group_meta.clone());
let task_cell = if let Some(resource_ctl) = resource_ctl {
TaskCell::new(
TrackedFuture::new(ControlledFuture::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about integrating resource controller to the Runner? Then, we needn't another group_meta clone for each future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find a good way to do so. Because typically, when scheduling a task, the system may need to generate a priority value for it and update some internal state based on the task's consuming resources. It's hard to do the second step in the runner. In this implement, as we only track the cpu time usage, it's possible to support in the runner. In the future, if we want to track more resources, e.g. reading bytes, it's seems hard to do it in the runner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implement, as we only track the cpu time usage, it's possible to support in the runner. In the future, if we want to track more resources, e.g. reading bytes, it's seems hard to do it in the runner.

Makes sense.

Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot added the status/LGT1 Status: PR - There is already 1 approval label Jan 12, 2023
}

fn add_resource_group(&self, name: Vec<u8>, ru_quota: u64) {
let mut max_ru_quota = self.max_ru_quota.lock().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use atomic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when triggering adjust_all_resource_group_factors we need to ensure the max ru quota is not changed. Since we don't assume that the CRUD of resource group happens very often, use a mutex can simple the logic.
If we use atomic here, when there are too new resource groups calling add at the same time and each of them triggers adjust all factors, we can ensure the later (which is large) takes effect for all groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to explain this mutex.

Signed-off-by: glorv <glorvs@163.com>
@HuSharp
Copy link
Member

HuSharp commented Jan 12, 2023

/test

Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Status: PR - There are already 2 approvals and removed status/LGT1 Status: PR - There is already 1 approval labels Jan 12, 2023
@HuSharp
Copy link
Member

HuSharp commented Jan 12, 2023

/test

@glorv
Copy link
Contributor Author

glorv commented Jan 12, 2023

/test

@Connor1996
Copy link
Member

/merge

@ti-chi-bot
Copy link
Member

@Connor1996: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@Connor1996
Copy link
Member

/merge

@ti-chi-bot
Copy link
Member

@Connor1996: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: f866d91

@ti-chi-bot ti-chi-bot added the status/can-merge Status: Can merge to base branch label Jan 13, 2023
@ti-chi-bot ti-chi-bot merged commit 2daa168 into tikv:master Jan 13, 2023
@ti-chi-bot ti-chi-bot added this to the Pool milestone Jan 13, 2023
@glorv glorv deleted the resource_group branch April 25, 2023 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note size/XXL status/can-merge Status: Can merge to base branch status/LGT2 Status: PR - There are already 2 approvals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants