Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hasNull statistic reading ability to ORC #11747

Merged
merged 3 commits into from
Sep 26, 2022

Conversation

devavret
Copy link
Contributor

Description

Adds the ability for ORC statistics reader to read the value ColumnStatistics::hasNull.

Contributes to #7087. Does not close it because the issue also requires the ability to write the field in the orc writer.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@devavret devavret requested a review from a team as a code owner September 22, 2022 20:59
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 22, 2022
@devavret devavret added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Sep 22, 2022
auto const stats = cudf_io::read_parsed_orc_statistics(
cudf_io::source_info{reinterpret_cast<char const*>(nulls_orc.data()), nulls_orc.size()});

EXPECT_EQ(stats.file_stats[1].has_null, true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The stats values for columns start from index 1. See similar test OrcStatisticsTest.Basic:

cudf/cpp/tests/io/orc_test.cpp

Lines 1015 to 1024 in 398aebe

auto validate_statistics = [&](std::vector<cudf_io::column_statistics> const& stats) {
auto& s0 = stats[0];
EXPECT_EQ(*s0.number_of_values, 9ul);
auto& s1 = stats[1];
EXPECT_EQ(*s1.number_of_values, 4ul);
auto& ts1 = std::get<cudf_io::integer_statistics>(s1.type_specific_stats);
EXPECT_EQ(*ts1.minimum, 1);
EXPECT_EQ(*ts1.maximum, 7);
EXPECT_EQ(*ts1.sum, 16);

@ttnghia ttnghia added this to PR-WIP in v22.10 Release via automation Sep 22, 2022
@codecov
Copy link

codecov bot commented Sep 22, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@5c91739). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11747   +/-   ##
===============================================
  Coverage                ?   87.52%           
===============================================
  Files                   ?      133           
  Lines                   ?    21774           
  Branches                ?        0           
===============================================
  Hits                    ?    19057           
  Misses                  ?     2717           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@devavret devavret requested a review from vuule September 23, 2022 09:29
// ----
// a: [[1,2,null]]
// b: [[3,4,5]]
auto nulls_orc = std::array<uint8_t, 308>{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to be replaced by a round-trip write+read in the future then we may need to add comment to clarify that here.

v22.10 Release automation moved this from PR-WIP to PR-Reviewer approved Sep 23, 2022
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch!
Looks good, just one suggestion.

cudf_io::source_info{reinterpret_cast<char const*>(nulls_orc.data()), nulls_orc.size()});

EXPECT_EQ(stats.file_stats[1].has_null, true);
EXPECT_EQ(stats.file_stats[2].has_null, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also validate stripe stats here
EXPECT_EQ(stats.stripes_stats[0][1].has_null, true);
EXPECT_EQ(stats.stripes_stats[0][2].has_null, false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a little clarification on why the column stats start from index 1?

cudf/cpp/tests/io/orc_test.cpp

Lines 1015 to 1024 in 398aebe

auto validate_statistics = [&](std::vector<cudf_io::column_statistics> const& stats) {
auto& s0 = stats[0];
EXPECT_EQ(*s0.number_of_values, 9ul);
auto& s1 = stats[1];
EXPECT_EQ(*s1.number_of_values, 4ul);
auto& ts1 = std::get<cudf_io::integer_statistics>(s1.type_specific_stats);
EXPECT_EQ(*ts1.minimum, 1);
EXPECT_EQ(*ts1.maximum, 7);
EXPECT_EQ(*ts1.sum, 16);

Is it because the first column is actually a Struct that contains these child columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ORC files have a root struct column.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 23, 2022
@devavret
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6131bd6 into rapidsai:branch-22.10 Sep 26, 2022
v22.10 Release automation moved this from PR-Reviewer approved to Done Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants