-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: calc rows per block for recluster #17639
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Docker Image for PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors block threshold calculations for recluster and compaction operations, ensuring that the computed rows per block better match the actual data size, row count, and compression metrics. Key changes include:
- Refactoring of block thresholds with new functions (calc_rows_for_compact and calc_rows_for_recluster) that use revised min/max byte thresholds.
- Updates to various modules and tests to adopt the new threshold parameters and default configuration constants.
- Fixes in Hilbert recluster logic and improved propagation of updated thresholds in block compaction.
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/query/expression/tests/it/block_thresholds.rs | Added tests for verifying the new block threshold calculations. |
src/query/storages/fuse/src/operations/mutation/mutator/block_compact_mutator.rs | Updated usage of SegmentCompactChecker and threshold parameters. |
src/query/service/src/pipelines/processors/transforms/window/partition/data_processor_strategy.rs | Adjusted compact strategy to use new configuration settings. |
src/query/service/tests/it/storages/fuse/statistics.rs | Updated test thresholds to match new defaults. |
src/query/sql/src/executor/physical_plans/physical_recluster.rs | Injected rows_per_block in HilbertPartition for reclustering. |
src/common/io/src/constants.rs | Modified default constants for block buffer and compression sizes. |
src/query/service/src/pipelines/builders/builder_recluster.rs | Replaced calc_rows_per_block with calc_rows_for_recluster. |
src/query/expression/src/utils/block_thresholds.rs | Refactored the threshold calculations and renamed functions for clarity. |
src/query/catalog/src/table.rs | Leveraged default thresholds for table block settings. |
src/query/service/src/pipelines/builders/builder_hilbert_partition.rs | Updated compact strategy construction with calculated max_bytes_per_block. |
(Other test and interpreter files) | Adjusted to use the new BlockThresholds API and constants. |
Comments suppressed due to low confidence (3)
src/query/expression/src/utils/block_thresholds.rs:149
- The condition comparing the row-based block count with the compressed-based block count should be re-evaluated for scenarios with borderline data distributions. Consider adding targeted tests to verify that this logic yields the expected block row calculations in edge cases.
if block_num_by_rows >= block_num_by_compressed {
src/query/service/src/pipelines/builders/builder_hilbert_partition.rs:77
- [nitpick] Consider extracting the calculation for max_bytes_per_block into a dedicated helper or constant to improve readability and maintain consistency across modules.
let max_bytes_per_block = std::cmp::min(4 * table.get_option(FUSE_OPT_KEY_BLOCK_IN_MEM_SIZE_THRESHOLD, DEFAULT_BLOCK_BUFFER_SIZE), 400 * 1024 * 1024);
src/query/expression/src/utils/block_thresholds.rs:110
- [nitpick] Ensure that the naming and documentation clearly differentiate 'calc_rows_for_compact' from 'calc_rows_for_recluster', as their purposes are similar but apply in different scenarios.
pub fn calc_rows_for_compact(&self, total_bytes: usize, total_rows: usize) -> usize {
LGTM @youngsofun , could you please help reviewing the changes in the following two places:
thanks |
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
Dynamically calculate optimal rows per block
Introduced logic to compute the optimal number of rows per block based on total data size, row count, and compressed size. This ensures that generated blocks meet both performance and storage efficiency thresholds.
Fix potential fragmentation in Hilbert recluster
Resolved an issue where Hilbert-based reclustering could result in fragmented small blocks, affecting downstream compaction and performance.
Enable effective compaction after modifying
block_size_thresholds
Adjustments to
BlockThresholds
now properly propagate into the block compaction logic, ensuring compact operations behave as expected after threshold changes.Update default configuration:
file_size
=16MB
,block_size_thresholds
=125MB
Tests
Type of change
This change is