chore: calc rows per block for recluster #17639

zhyass · 2025-03-22T15:02:21Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Dynamically calculate optimal rows per block
Introduced logic to compute the optimal number of rows per block based on total data size, row count, and compressed size. This ensures that generated blocks meet both performance and storage efficiency thresholds.
Fix potential fragmentation in Hilbert recluster
Resolved an issue where Hilbert-based reclustering could result in fragmented small blocks, affecting downstream compaction and performance.
Enable effective compaction after modifying block_size_thresholds
Adjustments to BlockThresholds now properly propagate into the block compaction logic, ensuring compact operations behave as expected after threshold changes.
Update default configuration: file_size = 16MB, block_size_thresholds = 125MB

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

github-actions · 2025-03-26T10:44:47Z

Docker Image for PR

tag: pr-17639-b534304-1742985812

note: this image tag is only available for internal use,
please check the internal doc for more details.

Copilot

Pull Request Overview

This PR refactors block threshold calculations for recluster and compaction operations, ensuring that the computed rows per block better match the actual data size, row count, and compression metrics. Key changes include:

Refactoring of block thresholds with new functions (calc_rows_for_compact and calc_rows_for_recluster) that use revised min/max byte thresholds.
Updates to various modules and tests to adopt the new threshold parameters and default configuration constants.
Fixes in Hilbert recluster logic and improved propagation of updated thresholds in block compaction.

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/query/expression/tests/it/block_thresholds.rs	Added tests for verifying the new block threshold calculations.
src/query/storages/fuse/src/operations/mutation/mutator/block_compact_mutator.rs	Updated usage of SegmentCompactChecker and threshold parameters.
src/query/service/src/pipelines/processors/transforms/window/partition/data_processor_strategy.rs	Adjusted compact strategy to use new configuration settings.
src/query/service/tests/it/storages/fuse/statistics.rs	Updated test thresholds to match new defaults.
src/query/sql/src/executor/physical_plans/physical_recluster.rs	Injected rows_per_block in HilbertPartition for reclustering.
src/common/io/src/constants.rs	Modified default constants for block buffer and compression sizes.
src/query/service/src/pipelines/builders/builder_recluster.rs	Replaced calc_rows_per_block with calc_rows_for_recluster.
src/query/expression/src/utils/block_thresholds.rs	Refactored the threshold calculations and renamed functions for clarity.
src/query/catalog/src/table.rs	Leveraged default thresholds for table block settings.
src/query/service/src/pipelines/builders/builder_hilbert_partition.rs	Updated compact strategy construction with calculated max_bytes_per_block.
(Other test and interpreter files)	Adjusted to use the new BlockThresholds API and constants.

Comments suppressed due to low confidence (3)

src/query/expression/src/utils/block_thresholds.rs:149

The condition comparing the row-based block count with the compressed-based block count should be re-evaluated for scenarios with borderline data distributions. Consider adding targeted tests to verify that this logic yields the expected block row calculations in edge cases.

if block_num_by_rows >= block_num_by_compressed {

src/query/service/src/pipelines/builders/builder_hilbert_partition.rs:77

[nitpick] Consider extracting the calculation for max_bytes_per_block into a dedicated helper or constant to improve readability and maintain consistency across modules.

let max_bytes_per_block = std::cmp::min(4 * table.get_option(FUSE_OPT_KEY_BLOCK_IN_MEM_SIZE_THRESHOLD, DEFAULT_BLOCK_BUFFER_SIZE), 400 * 1024 * 1024);

src/query/expression/src/utils/block_thresholds.rs:110

[nitpick] Ensure that the naming and documentation clearly differentiate 'calc_rows_for_compact' from 'calc_rows_for_recluster', as their purposes are similar but apply in different scenarios.

pub fn calc_rows_for_compact(&self, total_bytes: usize, total_rows: usize) -> usize {

src/query/storages/stage/src/read/block_builder_state.rs

dantengsky · 2025-03-27T02:23:26Z

LGTM

@youngsofun , could you please help reviewing the changes in the following two places:

src/query/storages/stage/src/read/block_builder_state.rs
src/query/storages/stage/src/read/row_based/processors/block_builder.rs

thanks

zhyass marked this pull request as draft March 22, 2025 15:02

github-actions bot added the pr-chore label Mar 22, 2025

zhyass added ci-cloud and removed ci-cloud labels Mar 22, 2025

This comment was marked as outdated.

Sign in to view

databendlabs deleted a comment from github-actions bot Mar 23, 2025

zhyass added ci-cloud and removed ci-cloud labels Mar 24, 2025

This comment was marked as outdated.

Sign in to view

zhyass force-pushed the feature_fix branch from afc0a72 to 6812466 Compare March 26, 2025 07:10

zhyass added ci-cloud and removed ci-cloud labels Mar 26, 2025

This comment was marked as outdated.

Sign in to view

zhyass added 7 commits March 26, 2025 17:06

fix calc rows per block

18a2ef2

fix

9ac6c43

enable compact when block_size_threshold changed

1007a71

fix

ef00c3d

fix hilbert recluster get small blocks

57f3910

add unit test

951471e

fix test

1325d5d

zhyass force-pushed the feature_fix branch from 2301201 to 1325d5d Compare March 26, 2025 09:06

zhyass added ci-cloud and removed ci-cloud labels Mar 26, 2025

zhyass marked this pull request as ready for review March 26, 2025 12:45

zhyass requested review from BohuTANG, zhang2014 and dantengsky March 26, 2025 12:45

BohuTANG requested a review from Copilot March 27, 2025 00:29

Copilot AI reviewed Mar 27, 2025

View reviewed changes

src/query/storages/stage/src/read/block_builder_state.rs Show resolved Hide resolved

dantengsky requested a review from youngsofun March 27, 2025 02:23

dantengsky approved these changes Mar 27, 2025

View reviewed changes

BohuTANG merged commit cd95677 into databendlabs:main Mar 27, 2025
257 of 268 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: calc rows per block for recluster #17639

chore: calc rows per block for recluster #17639

zhyass commented Mar 22, 2025 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

github-actions bot commented Mar 26, 2025

Copilot AI left a comment

dantengsky commented Mar 27, 2025

chore: calc rows per block for recluster #17639

chore: calc rows per block for recluster #17639

Conversation

zhyass commented Mar 22, 2025 • edited Loading

Summary

Tests

Type of change

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

github-actions bot commented Mar 26, 2025

Docker Image for PR

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

dantengsky commented Mar 27, 2025

zhyass commented Mar 22, 2025 •

edited

Loading