-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(meta): split by table according write throughput #15547
Conversation
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
5d4ac81
to
4407f12
Compare
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
src/meta/src/hummock/manager/mod.rs
Outdated
1, | ||
params.checkpoint_frequency() * barrier_interval_ms / 1000, | ||
); | ||
let history_table_throughput = self.history_table_throughput.read(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have already used size to simulate throughput, I think there is no need to judge history_table_throughput
here?
If the size of the table in input is small, but history_table_throughput
is high, do we need to split it? Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think we shall split it at least every table in one files.
let total_size = level.total_file_size | ||
+ handlers[upper_level].get_pending_output_file_size(level.level_idx) | ||
- handlers[level_idx].get_pending_output_file_size(level.level_idx + 1); | ||
let output_file_size = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this will change the priority of the current level compaction, let us discuss it offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx)
here?
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9425213 | Triggered | Generic Password | e5b4a02 | e2e_test/source/cdc/cdc.validate.postgres.slt | View secret |
9425213 | Triggered | Generic Password | 7509db8 | e2e_test/source/cdc/cdc.validate.postgres.slt | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
let total_size = level.total_file_size | ||
+ handlers[upper_level].get_pending_output_file_size(level.level_idx) | ||
- handlers[level_idx].get_pending_output_file_size(level.level_idx + 1); | ||
let output_file_size = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this purpose?
for sst in &input_ssts.table_infos { | ||
existing_table_ids.extend(sst.table_ids.iter()); | ||
if !sst.table_ids.is_empty() { | ||
*table_size_info.entry(sst.table_ids[0]).or_default() += |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only table_ids[0] is counted, is this an estimation algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm, No. I will fix it. In fact, I split all sst by table just to make here calculation to be accurately
1, | ||
params.checkpoint_frequency() * barrier_interval_ms / 1000, | ||
); | ||
let history_table_throughput = self.history_table_throughput.read(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some document
self.last_table_id = user_key.table_id.table_id; | ||
self.split_weight_by_vnode = 0; | ||
self.largest_vnode_in_current_partition = VirtualNode::MAX.to_index(); | ||
if let Some(builder) = self.current_builder.as_ref() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is only split by table, not partition vnode. I think it is acceptable.
35b0072
to
4c9f9fc
Compare
let total_size = level.total_file_size | ||
+ handlers[upper_level].get_pending_output_file_size(level.level_idx) | ||
- handlers[level_idx].get_pending_output_file_size(level.level_idx + 1); | ||
let output_file_size = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx)
here?
if compact_table_size > compaction_config.max_compaction_bytes / 2 { | ||
compact_task | ||
.table_vnode_partition | ||
.insert(table_id, default_partition_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand why we assign default_partition_count
for table with size > max_compaction_bytes / 2
. It is because we assign default_partition_count
for the split compaction group and we want to treat the table in the hybrid group with large size in the task the same. This can easily confuse others. Can we add some comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a magic number. I need some value to decide whether to partition for large task
But how to decide whether a task is large task ?
I can give another value which is default_partition_count * compaction_config.target_file_size_base
} else if compact_table_size > compaction_config.sub_level_max_compaction_bytes | ||
|| (compact_table_size > compaction_config.target_file_size_base | ||
&& write_throughput > self.env.opts.table_write_throughput_threshold) | ||
{ | ||
// partition for large write throughput table. | ||
compact_task | ||
.table_vnode_partition | ||
.insert(table_id, hybrid_vnode_count); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Please add some comments here. Also, why do we use sub_level_max_compaction_bytes
and target_file_size_base
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I add some comment
if compact_table_size > compaction_config.max_compaction_bytes / 2 { | ||
compact_task | ||
.table_vnode_partition | ||
.insert(table_id, default_partition_count); | ||
} else if compact_table_size > compaction_config.sub_level_max_compaction_bytes | ||
|| (compact_table_size > compaction_config.target_file_size_base | ||
&& write_throughput > self.env.opts.table_write_throughput_threshold) | ||
{ | ||
// partition for large write throughput table. | ||
compact_task | ||
.table_vnode_partition | ||
.insert(table_id, hybrid_vnode_count); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make sure this strategy only affects L0 compaction tasks but not other compaction tasks? Otherwise, I think we can easily create small files in bottom levels where compact_table_size
is generally large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For L0 and base level.
Not other level because we will clear table_vnode_partition
in code behind
I refactored code before although RocksDb only reduce pending input file size from current level. But I found that it is not a good idea, because it will compact data earlier before the pending task really change the shape of LSM tree. |
7a9cd17
to
fc6b2e0
Compare
if let Some(table_size) = table_size_infos.get(table_id) | ||
&& *table_size > min_sstable_size | ||
{ | ||
table_vnode_partition.insert(*table_id, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it beneficial to change shared_buffer_compact
to enable table split? Given that tier-compaction is a must and it will take the whole overlapping level as the input, spliting CN SST seems unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split to calculate size of each state-table in sst more accurately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to introduce a new config min_sstable_size
instead of taget_file_size_base
of sstable_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because compute-node can not know taget_file_size_base
if let Some(table_size) = table_size_infos.get(table_id) | ||
&& *table_size > min_sstable_size | ||
{ | ||
table_vnode_partition.insert(*table_id, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to introduce a new config min_sstable_size
instead of taget_file_size_base
of sstable_size
?
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
8cd4347
to
b5ef656
Compare
We have solved this requested in other place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
src/config/docs.md
Outdated
| hybrid_few_partition_threshold | | 134217728 | | ||
| hybrid_more_partition_threshold | | 536870912 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe no one will understand what these two configs mean by looking at the names.
How about:
compact_task_table_size_split_threshold_low
compact_task_table_size_split_threshold_high
Also, let's fill-in the description colum here in docks.md
to explain what these two configs mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about compact_task_table_size_partition_threshold_low
? Because we do not split these table exactly.
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
about: #15973
When a large MV creating on based table (or other MV), the data may delay in L0 and can not compact to base level in time. #13075 has proposed a solution, which is check the average throughput and partition the table with large throughput and size.
But there are some problem making that PR does not work in some case:
We have discuss this problem offline and we all agree that we must split theses tables belong to creating MV into independent group. But it means that we shall also merge those group which do not write much data after creating MV successfully. Before we have implement group merge, this PR can increase compact speed for default group. And although we can split all state-table with large write throughput, it is still better to partition them in advance before split because split-group only effects new data flush after splitting.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.