-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hasNull statistic reading ability to ORC #11747
Conversation
auto const stats = cudf_io::read_parsed_orc_statistics( | ||
cudf_io::source_info{reinterpret_cast<char const*>(nulls_orc.data()), nulls_orc.size()}); | ||
|
||
EXPECT_EQ(stats.file_stats[1].has_null, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: The stats values for columns start from index 1. See similar test OrcStatisticsTest.Basic
:
cudf/cpp/tests/io/orc_test.cpp
Lines 1015 to 1024 in 398aebe
auto validate_statistics = [&](std::vector<cudf_io::column_statistics> const& stats) { | |
auto& s0 = stats[0]; | |
EXPECT_EQ(*s0.number_of_values, 9ul); | |
auto& s1 = stats[1]; | |
EXPECT_EQ(*s1.number_of_values, 4ul); | |
auto& ts1 = std::get<cudf_io::integer_statistics>(s1.type_specific_stats); | |
EXPECT_EQ(*ts1.minimum, 1); | |
EXPECT_EQ(*ts1.maximum, 7); | |
EXPECT_EQ(*ts1.sum, 16); |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11747 +/- ##
===============================================
Coverage ? 87.52%
===============================================
Files ? 133
Lines ? 21774
Branches ? 0
===============================================
Hits ? 19057
Misses ? 2717
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
// ---- | ||
// a: [[1,2,null]] | ||
// b: [[3,4,5]] | ||
auto nulls_orc = std::array<uint8_t, 308>{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is going to be replaced by a round-trip write+read in the future then we may need to add comment to clarify that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch!
Looks good, just one suggestion.
cudf_io::source_info{reinterpret_cast<char const*>(nulls_orc.data()), nulls_orc.size()}); | ||
|
||
EXPECT_EQ(stats.file_stats[1].has_null, true); | ||
EXPECT_EQ(stats.file_stats[2].has_null, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also validate stripe stats here
EXPECT_EQ(stats.stripes_stats[0][1].has_null, true);
EXPECT_EQ(stats.stripes_stats[0][2].has_null, false);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a little clarification on why the column stats start from index 1?
cudf/cpp/tests/io/orc_test.cpp
Lines 1015 to 1024 in 398aebe
auto validate_statistics = [&](std::vector<cudf_io::column_statistics> const& stats) { | |
auto& s0 = stats[0]; | |
EXPECT_EQ(*s0.number_of_values, 9ul); | |
auto& s1 = stats[1]; | |
EXPECT_EQ(*s1.number_of_values, 4ul); | |
auto& ts1 = std::get<cudf_io::integer_statistics>(s1.type_specific_stats); | |
EXPECT_EQ(*ts1.minimum, 1); | |
EXPECT_EQ(*ts1.maximum, 7); | |
EXPECT_EQ(*ts1.sum, 16); |
Is it because the first column is actually a Struct that contains these child columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ORC files have a root struct column.
@gpucibot merge |
Description
Adds the ability for ORC statistics reader to read the value
ColumnStatistics::hasNull
.Contributes to #7087. Does not close it because the issue also requires the ability to write the field in the orc writer.
Checklist