Skip to content

Commit

Permalink
Add fields to Parquet Statistics structure that were added in parquet…
Browse files Browse the repository at this point in the history
…-format 2.10 (#15412)

[PARQUET-2352](apache/parquet-format#216) added fields to the `Statistics` struct to indicate whether the min and max values were exact or had been truncated. This was somewhat ambiguous in the past. One reason to want to know this is to allow avoiding the decoding of pages (or column chunks) that contain a single value (if the min and max are the same value, and are known to be exact values, and there are no nulls, then the only valid value for the page will be that value). This PR adds these new fields, which will always be `true` in cuDF since cuDF does not support truncating min and max values in the statistics (but does support truncation in the page indexes).

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #15412
  • Loading branch information
etseidl committed Apr 26, 2024
1 parent d91a4ad commit 064dd7b
Show file tree
Hide file tree
Showing 5 changed files with 19 additions and 1 deletion.
5 changes: 4 additions & 1 deletion cpp/src/io/parquet/compact_protocol_reader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -763,13 +763,16 @@ void CompactProtocolReader::read(Statistics* s)
{
using optional_binary = parquet_field_optional<std::vector<uint8_t>, parquet_field_binary>;
using optional_int64 = parquet_field_optional<int64_t, parquet_field_int64>;
using optional_bool = parquet_field_optional<bool, parquet_field_bool>;

auto op = std::make_tuple(optional_binary(1, s->max),
optional_binary(2, s->min),
optional_int64(3, s->null_count),
optional_int64(4, s->distinct_count),
optional_binary(5, s->max_value),
optional_binary(6, s->min_value));
optional_binary(6, s->min_value),
optional_bool(7, s->is_max_value_exact),
optional_bool(8, s->is_min_value_exact));
function_builder(this, op);
}

Expand Down
2 changes: 2 additions & 0 deletions cpp/src/io/parquet/compact_protocol_writer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,8 @@ size_t CompactProtocolWriter::write(Statistics const& s)
if (s.distinct_count.has_value()) { c.field_int(4, s.distinct_count.value()); }
if (s.max_value.has_value()) { c.field_binary(5, s.max_value.value()); }
if (s.min_value.has_value()) { c.field_binary(6, s.min_value.value()); }
if (s.is_max_value_exact.has_value()) { c.field_bool(7, s.is_max_value_exact.value()); }
if (s.is_min_value_exact.has_value()) { c.field_bool(8, s.is_min_value_exact.value()); }
return c.value();
}

Expand Down
3 changes: 3 additions & 0 deletions cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -2944,6 +2944,9 @@ __device__ uint8_t* EncodeStatistics(uint8_t* start,
auto const [min_ptr, min_size] =
get_extremum(&s->min_value, dtype, scratch, true, NO_TRUNC_STATS);
encoder.field_binary(6, min_ptr, min_size);
// cudf min/max statistics are always exact (i.e. not truncated)
encoder.field_bool(7, true);
encoder.field_bool(8, true);
}
encoder.end(&end);
return end;
Expand Down
4 changes: 4 additions & 0 deletions cpp/src/io/parquet/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,10 @@ struct Statistics {
thrust::optional<std::vector<uint8_t>> max_value;
// min value for column determined by ColumnOrder
thrust::optional<std::vector<uint8_t>> min_value;
// If true, max_value is the actual maximum value for a column
thrust::optional<bool> is_max_value_exact;
// If true, min_value is the actual minimum value for a column
thrust::optional<bool> is_min_value_exact;
};

/**
Expand Down
6 changes: 6 additions & 0 deletions cpp/tests/io/parquet_writer_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -903,6 +903,12 @@ TEST_F(ParquetWriterTest, CheckColumnIndexTruncation)
ASSERT_TRUE(stats.min_value.has_value());
ASSERT_TRUE(stats.max_value.has_value());

// check that min and max for the column chunk are exact (i.e. not truncated)
ASSERT_TRUE(stats.is_max_value_exact.has_value());
EXPECT_TRUE(stats.is_max_value_exact.value());
ASSERT_TRUE(stats.is_min_value_exact.has_value());
EXPECT_TRUE(stats.is_min_value_exact.value());

// check trunc(page.min) <= stats.min && trun(page.max) >= stats.max
auto const ptype = fmd.schema[c + 1].type;
auto const ctype = fmd.schema[c + 1].converted_type;
Expand Down

0 comments on commit 064dd7b

Please sign in to comment.