Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return empty dataframe when reading a Parquet file using empty columns option #11018

Merged
merged 14 commits into from
Jun 17, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jun 1, 2022

Fixes #8668
Store the columns option as optional:

  • nullopt when columns are not passed by caller - read all columns.
  • Empty vector when caller explicitly passes an empty list/vector - return empty dataframe.
  • Vector of column names - read columns with given names.

@vuule vuule added bug Something isn't working cuIO cuIO issue breaking Breaking change labels Jun 1, 2022
@vuule vuule self-assigned this Jun 1, 2022
@vuule vuule added this to PR-WIP in v22.08 Release via automation Jun 1, 2022
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jun 1, 2022
@codecov
Copy link

codecov bot commented Jun 1, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@f5faa99). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11018   +/-   ##
===============================================
  Coverage                ?   86.34%           
===============================================
  Files                   ?      144           
  Lines                   ?    22741           
  Branches                ?        0           
===============================================
  Hits                    ?    19635           
  Misses                  ?     3106           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5faa99...2af0afc. Read the comment docs.

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried out your changes locally, libcudf behavior seems to be expected now.
These are the corresponding additional cython/python changes needed to achieve feature parity with pandas.

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved
v22.08 Release automation moved this from PR-WIP to PR-Needs review Jun 1, 2022
@vuule vuule requested a review from galipremsagar June 15, 2022 22:13
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. labels Jun 15, 2022
@vuule vuule changed the base branch from branch-22.08 to branch-22.06 June 15, 2022 22:14
@vuule vuule changed the base branch from branch-22.06 to branch-22.08 June 15, 2022 22:14
@github-actions github-actions bot removed CMake CMake build issue gpuCI labels Jun 15, 2022
@vuule vuule requested review from a team as code owners June 16, 2022 14:19
@vuule vuule requested review from PointKernel and jlowe June 16, 2022 19:28
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but this looks good to me. A clean solution to the problem.

*
* @return Names of column to be read
* @return Names of column to be read; `nullopt` if the option is not set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be std::nullopt to be clear?

/**
* Read parquet formatted data.
* @param opts various parquet parsing options.
* @param buffer raw parquet formatted bytes.
* @return the data parsed as a table on the GPU.
*/
public static Table readParquet(ParquetOptions opts, byte[] buffer) {
if (opts.getIncludeColumnNames().length == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an undesirable change and is reflected in the corresponding update to the tests. If the caller must specify all columns to read from the file, how does the caller read a Parquet file when they don't know the columns in the file? (e.g.: doing dynamic schema discovery). This used to be possible with the Java API but now seems like it isn't.

Comment on lines 592 to 617
.includeColumn("loan_id")
.includeColumn("orig_channel")
.includeColumn("orig_interest_rate")
.includeColumn("orig_upb")
.includeColumn("orig_loan_term")
.includeColumn("orig_date")
.includeColumn("first_pay_date")
.includeColumn("orig_ltv")
.includeColumn("orig_cltv")
.includeColumn("num_borrowers")
.includeColumn("dti")
.includeColumn("borrower_credit_score")
.includeColumn("first_home_buyer")
.includeColumn("loan_purpose")
.includeColumn("property_type")
.includeColumn("num_units")
.includeColumn("occupancy_status")
.includeColumn("property_state")
.includeColumn("zip")
.includeColumn("mortgage_insurance_percent")
.includeColumn("product_type")
.includeColumn("coborrow_credit_score")
.includeColumn("mortgage_insurance_type")
.includeColumn("relocation_mortgage_indicator")
.includeColumn("quarter")
.includeColumn("seller_id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be necessary. There needs to be a way to read all columns of a Parquet file without prior knowledge of all column names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to achieve this using includeColumn, but having includeColumns that takes an array of names allows this behavior - skip the call to includeColumns if you want to read all columns. If you want to read none, pass an empty array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this change should not have a significant impact on the API. If no columns are specified (i.e.: no includeColumn or includeColumns calls) then it should read all columns as it did before. That means all the existing tests should run without modification. However the code explicitly throws if no columns were ever specified, which required many existing tests to be updated. That's the thing we need to fix, otherwise there's no way to load a Parquet file without knowing the columns up-front.

having includeColumns that takes an array of names allows this behavior - skip the call to includeColumns if you want to read all columns. If you want to read none, pass an empty array.

Or don't call includeColumn or includeColumns at all, which is the way it used to work. If we want to add special behavior for a case where the caller explicitly does not want to load any columns (e.g.: opts.includeColumns(new String[0]) then we can do so. However I'm not sure we can actually load an empty Table. Table wraps a cudf::table_view and the latter does not allow construction with zero columns. It seems like the Java change, if any, should be checking for no columns specified when includeColumns is called and throwing there rather than checking for no columns when trying to load the Parquet file, as the latter means we want to load all columns rather than no columns. If that's done then no existing tests would be modified, all we would do is add a new test that verifies the API throws in that corner case.

@andygrove
Copy link
Contributor

I misunderstood what was needed on the Java side here. I have re-implemented this in my branch at https://github.com/andygrove/cudf/tree/bug-pq-reader-empty-col-names-java2 so that the JNI code will not add columns to the parquet_reader_options::builder if there are no columns.

@vuule could you merge those changes into this PR?

@vuule vuule requested a review from jlowe June 17, 2022 15:09
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the Java tests pass, lgtm.

v22.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jun 17, 2022
@vuule
Copy link
Contributor Author

vuule commented Jun 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 379faf9 into rapidsai:branch-22.08 Jun 17, 2022
v22.08 Release automation moved this from PR-Reviewer approved to Done Jun 17, 2022
rapids-bot bot pushed a commit that referenced this pull request Jun 20, 2022
Some recently merged PRs (#11018 + #11036) do not include enough header which may cause compile error in some systems (in particular, CUDA 11.7 + gcc-11.2). This PR adds the missing header (`<optional>`) to fix the compile issue.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11126
rapids-bot bot pushed a commit that referenced this pull request Aug 4, 2022
… option (#11446)

Changes are mostly equivalent to Parquet changes in #11018.

Store the `columns` option as `optional`:

- `nullopt` when columns are not passed by caller - read all columns.
- Empty vector when caller explicitly passes an empty list/vector - return empty dataframe.
- Vector of column names - read columns with given names.

Also includes a small cleanup of the code equivalent in the Parquet reader.

Fixes #11021

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11446
@vuule vuule deleted the bug-pq-reader-empty-col-names branch August 10, 2023 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working cuIO cuIO issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[BUG] Empty columns is not returning empty dataframe in read_parquet
7 participants