-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
Describe the bug
When reading parquet files with truncated stats datafusion will report the min/max as exact even though metadata in the file indicates that min/max has been truncated
To Reproduce
Create a parquet file with truncated statistics, read the file, the statistics on the table are Exact
Expected behavior
The statistic should be Absent or Inexact
Additional context
No response
adriangb and blaginin
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
alamb commentedon May 16, 2025
👍
alamb commentedon May 16, 2025
I think this is a good first issue as there is a test case and clear explanation of the need
CookiePieWw commentedon May 22, 2025
take
CookiePieWw commentedon May 29, 2025
Hi :) I've spent some time on this and found the problem in
get_col_stats
datafusion/datafusion/datasource-parquet/src/file_format.rs
Lines 1101 to 1107 in 2c2f225
Here we always use
Precision::Exact
to wrap the stats, but actually we need to respect theis_max_value_exact
andis_min_value_exact
flags in the column metadata.The max and min values are extracted at
datafusion/datafusion/datasource-parquet/src/file_format.rs
Lines 1112 to 1139 in 2c2f225
But I didn't find a method to access the
..exact
flags inStatisticsConverter
, so my plan is to first add functions similar torow_group_mins
to the converter to extract the flags, which requires a change toarrow-rs
first, and then collect and pass the extracted boolean array of flags toget_col_stats
to decide which one to use,Precision::Exact
andPrecision::InExact
.Please let me know if this direction makes sense.
alamb commentedon May 29, 2025
I think the first thing that is needed in arrow-rs is to expose the
is_max_value_exact
andis_min_value_exact
fields into the corresponding Rust structs (ValueStatistics
):https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html
https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html
row_group_is_[max/min]_value_exact
to StatisticsConverter apache/arrow-rs#7574CookiePieWw commentedon May 31, 2025
Thanks for your feedback! I found
ValueStatistics
has already havemax_is_exact
andmin_is_exact
, seems we can directly make use of them. I've drafted a pr at apache/arrow-rs#7574 :)feat: add `row_group_is_[max/min]_value_exact` to StatisticsConverter (…
nssalian commentedon Jul 21, 2025
take