Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(meta): split by table according write throughput #15547

Merged
merged 38 commits into from
May 27, 2024

Conversation

Little-Wallace
Copy link
Contributor

@Little-Wallace Little-Wallace commented Mar 8, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

about: #15973

When a large MV creating on based table (or other MV), the data may delay in L0 and can not compact to base level in time. #13075 has proposed a solution, which is check the average throughput and partition the table with large throughput and size.

But there are some problem making that PR does not work in some case:

  1. when we create MV on exist table, the barrier latency would be high and it will take a long time to collect enough window. And after it decides to partition some table, there may be a large data wait in level0 to be compact.
  2. feat(storage): optimize data alignment for default compaction group #13075 will also cut some table in a separate sst file if it has enough data in bottom level, but the size of data may be small in level0. So that solution would generate a lot of small files and each of them only has data of one table.
  3. When the throughput decrease, that solution will not continue to partition this table and it may cause compaction slow down.

We have discuss this problem offline and we all agree that we must split theses tables belong to creating MV into independent group. But it means that we shall also merge those group which do not write much data after creating MV successfully. Before we have implement group merge, this PR can increase compact speed for default group. And although we can split all state-table with large write throughput, it is still better to partition them in advance before split because split-group only effects new data flush after splitting.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@Little-Wallace Little-Wallace changed the title split by table according write throughput feat(meta): split by table according write throughput Mar 8, 2024
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
@Little-Wallace Little-Wallace marked this pull request as ready for review March 12, 2024 15:20
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
src/meta/src/hummock/manager/mod.rs Outdated Show resolved Hide resolved
1,
params.checkpoint_frequency() * barrier_interval_ms / 1000,
);
let history_table_throughput = self.history_table_throughput.read();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have already used size to simulate throughput, I think there is no need to judge history_table_throughput here?

If the size of the table in input is small, but history_table_throughput is high, do we need to split it? Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think we shall split it at least every table in one files.

src/meta/src/hummock/compaction/mod.rs Outdated Show resolved Hide resolved
let total_size = level.total_file_size
+ handlers[upper_level].get_pending_output_file_size(level.level_idx)
- handlers[level_idx].get_pending_output_file_size(level.level_idx + 1);
let output_file_size =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this will change the priority of the current level compaction, let us discuss it offline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this purpose?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx) here?

Li0k
Li0k previously requested changes Apr 2, 2024
src/storage/src/hummock/compactor/shared_buffer_compact.rs Outdated Show resolved Hide resolved
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Copy link

gitguardian bot commented Apr 29, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9425213 Triggered Generic Password e5b4a02 e2e_test/source/cdc/cdc.validate.postgres.slt View secret
9425213 Triggered Generic Password 7509db8 e2e_test/source/cdc/cdc.validate.postgres.slt View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
let total_size = level.total_file_size
+ handlers[upper_level].get_pending_output_file_size(level.level_idx)
- handlers[level_idx].get_pending_output_file_size(level.level_idx + 1);
let output_file_size =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this purpose?

for sst in &input_ssts.table_infos {
existing_table_ids.extend(sst.table_ids.iter());
if !sst.table_ids.is_empty() {
*table_size_info.entry(sst.table_ids[0]).or_default() +=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only table_ids[0] is counted, is this an estimation algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, No. I will fix it. In fact, I split all sst by table just to make here calculation to be accurately

1,
params.checkpoint_frequency() * barrier_interval_ms / 1000,
);
let history_table_throughput = self.history_table_throughput.read();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some document

src/meta/src/hummock/manager/compaction.rs Show resolved Hide resolved
src/meta/src/hummock/manager/compaction.rs Show resolved Hide resolved
self.last_table_id = user_key.table_id.table_id;
self.split_weight_by_vnode = 0;
self.largest_vnode_in_current_partition = VirtualNode::MAX.to_index();
if let Some(builder) = self.current_builder.as_ref()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we introduce this logic, it may bring more 4m sst when table_id switches. This seems to be contrary to #16495, which will bring more sst at high level

PTAL @hzxa21 @zwang28

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is only split by table, not partition vnode. I think it is acceptable.

let total_size = level.total_file_size
+ handlers[upper_level].get_pending_output_file_size(level.level_idx)
- handlers[level_idx].get_pending_output_file_size(level.level_idx + 1);
let output_file_size =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx) here?

Comment on lines 313 to 322
if compact_table_size > compaction_config.max_compaction_bytes / 2 {
compact_task
.table_vnode_partition
.insert(table_id, default_partition_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand why we assign default_partition_count for table with size > max_compaction_bytes / 2. It is because we assign default_partition_count for the split compaction group and we want to treat the table in the hybrid group with large size in the task the same. This can easily confuse others. Can we add some comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a magic number. I need some value to decide whether to partition for large task
But how to decide whether a task is large task ?
I can give another value which is default_partition_count * compaction_config.target_file_size_base

Comment on lines 317 to 337
} else if compact_table_size > compaction_config.sub_level_max_compaction_bytes
|| (compact_table_size > compaction_config.target_file_size_base
&& write_throughput > self.env.opts.table_write_throughput_threshold)
{
// partition for large write throughput table.
compact_task
.table_vnode_partition
.insert(table_id, hybrid_vnode_count);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Please add some comments here. Also, why do we use sub_level_max_compaction_bytes and target_file_size_base here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I add some comment

Comment on lines 313 to 337
if compact_table_size > compaction_config.max_compaction_bytes / 2 {
compact_task
.table_vnode_partition
.insert(table_id, default_partition_count);
} else if compact_table_size > compaction_config.sub_level_max_compaction_bytes
|| (compact_table_size > compaction_config.target_file_size_base
&& write_throughput > self.env.opts.table_write_throughput_threshold)
{
// partition for large write throughput table.
compact_task
.table_vnode_partition
.insert(table_id, hybrid_vnode_count);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make sure this strategy only affects L0 compaction tasks but not other compaction tasks? Otherwise, I think we can easily create small files in bottom levels where compact_table_size is generally large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For L0 and base level.
Not other level because we will clear table_vnode_partition in code behind

src/storage/src/hummock/sstable/multi_builder.rs Outdated Show resolved Hide resolved
@Little-Wallace
Copy link
Contributor Author

Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx) here?

I refactored code before although RocksDb only reduce pending input file size from current level. But I found that it is not a good idea, because it will compact data earlier before the pending task really change the shape of LSM tree.

src/storage/src/hummock/sstable/multi_builder.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/sstable/builder.rs Outdated Show resolved Hide resolved
if let Some(table_size) = table_size_infos.get(table_id)
&& *table_size > min_sstable_size
{
table_vnode_partition.insert(*table_id, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it beneficial to change shared_buffer_compact to enable table split? Given that tier-compaction is a must and it will take the whole overlapping level as the input, spliting CN SST seems unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split to calculate size of each state-table in sst more accurately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to introduce a new config min_sstable_size instead of taget_file_size_base of sstable_size ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because compute-node can not know taget_file_size_base

if let Some(table_size) = table_size_infos.get(table_id)
&& *table_size > min_sstable_size
{
table_vnode_partition.insert(*table_id, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to introduce a new config min_sstable_size instead of taget_file_size_base of sstable_size ?

src/storage/src/hummock/compactor/shared_buffer_compact.rs Outdated Show resolved Hide resolved
src/meta/src/hummock/manager/compaction.rs Outdated Show resolved Hide resolved
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
@Little-Wallace Little-Wallace dismissed Li0k’s stale review May 20, 2024 06:02

We have solved this requested in other place

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Comment on lines 40 to 41
| hybrid_few_partition_threshold | | 134217728 |
| hybrid_more_partition_threshold | | 536870912 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe no one will understand what these two configs mean by looking at the names.
How about:
compact_task_table_size_split_threshold_low
compact_task_table_size_split_threshold_high

Also, let's fill-in the description colum here in docks.md to explain what these two configs mean.

Copy link
Contributor Author

@Little-Wallace Little-Wallace May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about compact_task_table_size_partition_threshold_low ? Because we do not split these table exactly.

src/meta/src/hummock/manager/compaction.rs Outdated Show resolved Hide resolved
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
src/storage/src/opts.rs Show resolved Hide resolved
src/common/src/config.rs Outdated Show resolved Hide resolved
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
@Little-Wallace Little-Wallace added this pull request to the merge queue May 27, 2024
Merged via the queue into main with commit ac93e24 May 27, 2024
27 of 29 checks passed
@Little-Wallace Little-Wallace deleted the wallace/split-default-group branch May 27, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants