Skip to content

Treat truncated parquet stats as inexact #15976

@robert3005

Description

@robert3005

Describe the bug

When reading parquet files with truncated stats datafusion will report the min/max as exact even though metadata in the file indicates that min/max has been truncated

To Reproduce

Create a parquet file with truncated statistics, read the file, the statistics on the table are Exact

Expected behavior

The statistic should be Absent or Inexact

Additional context

No response

Activity

alamb

alamb commented on May 16, 2025

@alamb
Contributor

👍

alamb

alamb commented on May 16, 2025

@alamb
Contributor

I think this is a good first issue as there is a test case and clear explanation of the need

CookiePieWw

CookiePieWw commented on May 22, 2025

@CookiePieWw

take

CookiePieWw

CookiePieWw commented on May 29, 2025

@CookiePieWw

Hi :) I've spent some time on this and found the problem in get_col_stats

ColumnStatistics {
null_count: null_counts[i],
max_value: max_value.map(Precision::Exact).unwrap_or(Precision::Absent),
min_value: min_value.map(Precision::Exact).unwrap_or(Precision::Absent),
sum_value: Precision::Absent,
distinct_count: Precision::Absent,
}

Here we always use Precision::Exact to wrap the stats, but actually we need to respect the is_max_value_exact and is_min_value_exact flags in the column metadata.

The max and min values are extracted at

fn summarize_min_max_null_counts(
min_accs: &mut [Option<MinAccumulator>],
max_accs: &mut [Option<MaxAccumulator>],
null_counts_array: &mut [Precision<usize>],
arrow_schema_index: usize,
num_rows: usize,
stats_converter: &StatisticsConverter,
row_groups_metadata: &[RowGroupMetaData],
) -> Result<()> {
let max_values = stats_converter.row_group_maxes(row_groups_metadata)?;
let min_values = stats_converter.row_group_mins(row_groups_metadata)?;
let null_counts = stats_converter.row_group_null_counts(row_groups_metadata)?;
if let Some(max_acc) = &mut max_accs[arrow_schema_index] {
max_acc.update_batch(&[max_values])?;
}
if let Some(min_acc) = &mut min_accs[arrow_schema_index] {
min_acc.update_batch(&[min_values])?;
}
null_counts_array[arrow_schema_index] = Precision::Exact(match sum(&null_counts) {
Some(null_count) => null_count as usize,
None => num_rows,
});
Ok(())
}

But I didn't find a method to access the ..exact flags in StatisticsConverter, so my plan is to first add functions similar to row_group_mins to the converter to extract the flags, which requires a change to arrow-rs first, and then collect and pass the extracted boolean array of flags to get_col_stats to decide which one to use, Precision::Exact and Precision::InExact.

Please let me know if this direction makes sense.

alamb

alamb commented on May 29, 2025

@alamb
Contributor

But I didn't find a method to access the ..exact flags in StatisticsConverter, so my plan is to first add functions similar to row_group_mins to the converter to extract the flags, which requires a change to arrow-rs first, and then collect and pass the extracted boolean array of flags to get_col_stats to decide which one to use, Precision::Exact and Precision::InExact.

I think the first thing that is needed in arrow-rs is to expose the is_max_value_exact and is_min_value_exact fields into the corresponding Rust structs (ValueStatistics):

https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html
https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html

CookiePieWw

CookiePieWw commented on May 31, 2025

@CookiePieWw

Thanks for your feedback! I found ValueStatistics has already have max_is_exact and min_is_exact, seems we can directly make use of them. I've drafted a pr at apache/arrow-rs#7574 :)

nssalian

nssalian commented on Jul 21, 2025

@nssalian

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

bugSomething isn't workinggood first issueGood for newcomers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @alamb@robert3005@nssalian@CookiePieWw

    Issue actions

      Treat truncated parquet stats as inexact · Issue #15976 · apache/datafusion