Return empty dataframe when reading a Parquet file using empty `columns` option #11018

vuule · 2022-06-01T18:35:53Z

Fixes #8668
Store the columns option as optional:

nullopt when columns are not passed by caller - read all columns.
Empty vector when caller explicitly passes an empty list/vector - return empty dataframe.
Vector of column names - read columns with given names.

… no list

codecov · 2022-06-01T20:33:34Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@f5faa99). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11018   +/-   ##
===============================================
  Coverage                ?   86.34%           
===============================================
  Files                   ?      144           
  Lines                   ?    22741           
  Branches                ?        0           
===============================================
  Hits                    ?    19635           
  Misses                  ?     3106           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5faa99...2af0afc. Read the comment docs.

galipremsagar

I've tried out your changes locally, libcudf behavior seems to be expected now.
These are the corresponding additional cython/python changes needed to achieve feature parity with pandas.

python/cudf/cudf/_lib/parquet.pyx

Co-authored-by: GALI PREM SAGAR <sagarprem75@gmail.com>

…ule/cudf into branch-22.08

…om/andygrove/cudf into bug-pq-reader-empty-col-names

hyperbolic2346

Minor nit, but this looks good to me. A clean solution to the problem.

hyperbolic2346 · 2022-06-16T20:32:11Z

cpp/include/cudf/io/parquet.hpp

   *
-   * @return Names of column to be read
+   * @return Names of column to be read; `nullopt` if the option is not set


should this be std::nullopt to be clear?

jlowe · 2022-06-16T20:56:58Z

java/src/main/java/ai/rapids/cudf/Table.java

  /**
   * Read parquet formatted data.
   * @param opts various parquet parsing options.
   * @param buffer raw parquet formatted bytes.
   * @return the data parsed as a table on the GPU.
   */
  public static Table readParquet(ParquetOptions opts, byte[] buffer) {
+    if (opts.getIncludeColumnNames().length == 0) {


This looks like an undesirable change and is reflected in the corresponding update to the tests. If the caller must specify all columns to read from the file, how does the caller read a Parquet file when they don't know the columns in the file? (e.g.: doing dynamic schema discovery). This used to be possible with the Java API but now seems like it isn't.

jlowe · 2022-06-16T20:58:24Z

java/src/test/java/ai/rapids/cudf/TableTest.java

+            .includeColumn("loan_id")
+            .includeColumn("orig_channel")
+            .includeColumn("orig_interest_rate")
+            .includeColumn("orig_upb")
+            .includeColumn("orig_loan_term")
+            .includeColumn("orig_date")
+            .includeColumn("first_pay_date")
+            .includeColumn("orig_ltv")
+            .includeColumn("orig_cltv")
+            .includeColumn("num_borrowers")
+            .includeColumn("dti")
+            .includeColumn("borrower_credit_score")
+            .includeColumn("first_home_buyer")
+            .includeColumn("loan_purpose")
+            .includeColumn("property_type")
+            .includeColumn("num_units")
+            .includeColumn("occupancy_status")
+            .includeColumn("property_state")
+            .includeColumn("zip")
+            .includeColumn("mortgage_insurance_percent")
+            .includeColumn("product_type")
+            .includeColumn("coborrow_credit_score")
+            .includeColumn("mortgage_insurance_type")
+            .includeColumn("relocation_mortgage_indicator")
+            .includeColumn("quarter")
+            .includeColumn("seller_id")


This should not be necessary. There needs to be a way to read all columns of a Parquet file without prior knowledge of all column names.

I'm not sure how to achieve this using includeColumn, but having includeColumns that takes an array of names allows this behavior - skip the call to includeColumns if you want to read all columns. If you want to read none, pass an empty array.

Ideally this change should not have a significant impact on the API. If no columns are specified (i.e.: no includeColumn or includeColumns calls) then it should read all columns as it did before. That means all the existing tests should run without modification. However the code explicitly throws if no columns were ever specified, which required many existing tests to be updated. That's the thing we need to fix, otherwise there's no way to load a Parquet file without knowing the columns up-front.

having includeColumns that takes an array of names allows this behavior - skip the call to includeColumns if you want to read all columns. If you want to read none, pass an empty array.

Or don't call includeColumn or includeColumns at all, which is the way it used to work. If we want to add special behavior for a case where the caller explicitly does not want to load any columns (e.g.: opts.includeColumns(new String[0]) then we can do so. However I'm not sure we can actually load an empty Table. Table wraps a cudf::table_view and the latter does not allow construction with zero columns. It seems like the Java change, if any, should be checking for no columns specified when includeColumns is called and throwing there rather than checking for no columns when trying to load the Parquet file, as the latter means we want to load all columns rather than no columns. If that's done then no existing tests would be modified, all we would do is add a new test that verifies the API throws in that corner case.

andygrove · 2022-06-17T13:24:24Z

I misunderstood what was needed on the Java side here. I have re-implemented this in my branch at https://github.com/andygrove/cudf/tree/bug-pq-reader-empty-col-names-java2 so that the JNI code will not add columns to the parquet_reader_options::builder if there are no columns.

@vuule could you merge those changes into this PR?

jlowe

Assuming the Java tests pass, lgtm.

java/src/main/java/ai/rapids/cudf/ColumnFilterOptions.java

vuule · 2022-06-17T17:19:06Z

@gpucibot merge

Some recently merged PRs (#11018 + #11036) do not include enough header which may cause compile error in some systems (in particular, CUDA 11.7 + gcc-11.2). This PR adds the missing header (`<optional>`) to fix the compile issue. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Yunsong Wang (https://github.com/PointKernel) URL: #11126

… option (#11446) Changes are mostly equivalent to Parquet changes in #11018. Store the `columns` option as `optional`: - `nullopt` when columns are not passed by caller - read all columns. - Empty vector when caller explicitly passes an empty list/vector - return empty dataframe. - Vector of column names - read columns with given names. Also includes a small cleanup of the code equivalent in the Parquet reader. Fixes #11021 Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) URL: #11446

vuule added 3 commits June 1, 2022 11:04

change column type to optional to disambiguate between empty list and…

f6cc58e

… no list

C++ test

fd07994

copyright year

acd0e29

vuule added bug Something isn't working cuIO cuIO issue breaking Breaking change labels Jun 1, 2022

vuule self-assigned this Jun 1, 2022

vuule added this to PR-WIP in v22.08 Release via automation Jun 1, 2022

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jun 1, 2022

test fix

3afba35

galipremsagar requested changes Jun 1, 2022

View reviewed changes

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved

v22.08 Release automation moved this from PR-WIP to PR-Needs review Jun 1, 2022

galipremsagar mentioned this pull request Jun 1, 2022

[BUG] Empty columns is not returning empty dataframe in read_parquet #8668

Closed

andygrove and others added 4 commits June 15, 2022 08:50

fix java tests

40f01d1

python side changes pt1

42bb0e9

Co-authored-by: GALI PREM SAGAR <sagarprem75@gmail.com>

Merge branch 'bug-pq-reader-empty-col-names' of https://github.com/vu…

773c10a

…ule/cudf into branch-22.08

python changes pt2; test

5b678d3

vuule requested a review from galipremsagar June 15, 2022 22:13

github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. labels Jun 15, 2022

vuule changed the base branch from branch-22.08 to branch-22.06 June 15, 2022 22:14

vuule changed the base branch from branch-22.06 to branch-22.08 June 15, 2022 22:14

Merge branch 'bug-pq-reader-empty-col-names-java' of https://github.c…

e7afe2f

…om/andygrove/cudf into bug-pq-reader-empty-col-names

github-actions bot removed CMake CMake build issue gpuCI labels Jun 15, 2022

vuule requested review from a team as code owners June 16, 2022 14:19

vuule requested review from skirui-source, trxcllnt and devavret June 16, 2022 14:19

galipremsagar approved these changes Jun 16, 2022

View reviewed changes

vuule requested review from PointKernel and jlowe June 16, 2022 19:28

hyperbolic2346 approved these changes Jun 16, 2022

View reviewed changes

PointKernel approved these changes Jun 16, 2022

View reviewed changes

devavret approved these changes Jun 16, 2022

View reviewed changes

jlowe requested changes Jun 16, 2022

View reviewed changes

jlowe reviewed Jun 16, 2022

View reviewed changes

only set columns on builder if there are columns to set

ed6cb9a

vuule requested a review from jlowe June 17, 2022 15:09

jlowe approved these changes Jun 17, 2022

View reviewed changes

v22.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jun 17, 2022

style

47f1139

andygrove reviewed Jun 17, 2022

View reviewed changes

java/src/main/java/ai/rapids/cudf/ColumnFilterOptions.java Show resolved Hide resolved

jlowe reviewed Jun 17, 2022

View reviewed changes

java/src/main/java/ai/rapids/cudf/ColumnFilterOptions.java Show resolved Hide resolved

copyright year

2af0afc

rapids-bot bot merged commit 379faf9 into rapidsai:branch-22.08 Jun 17, 2022

v22.08 Release automation moved this from PR-Reviewer approved to Done Jun 17, 2022

ttnghia mentioned this pull request Jun 20, 2022

Fix compile error due to missing header #11126

Merged

vuule mentioned this pull request Aug 3, 2022

Return empty dataframe when reading an ORC file using empty columns option #11446

Merged

3 tasks

vuule deleted the bug-pq-reader-empty-col-names branch August 10, 2023 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return empty dataframe when reading a Parquet file using empty `columns` option #11018

Return empty dataframe when reading a Parquet file using empty `columns` option #11018

vuule commented Jun 1, 2022 •

edited

Loading

codecov bot commented Jun 1, 2022 •

edited

Loading

galipremsagar left a comment

hyperbolic2346 left a comment

hyperbolic2346 Jun 16, 2022

jlowe Jun 16, 2022

jlowe Jun 16, 2022

vuule Jun 17, 2022

jlowe Jun 17, 2022

andygrove commented Jun 17, 2022

jlowe left a comment

vuule commented Jun 17, 2022

Return empty dataframe when reading a Parquet file using empty columns option #11018

Return empty dataframe when reading a Parquet file using empty columns option #11018

Conversation

vuule commented Jun 1, 2022 • edited Loading

codecov bot commented Jun 1, 2022 • edited Loading

Codecov Report

galipremsagar left a comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

hyperbolic2346 Jun 16, 2022

Choose a reason for hiding this comment

jlowe Jun 16, 2022

Choose a reason for hiding this comment

jlowe Jun 16, 2022

Choose a reason for hiding this comment

vuule Jun 17, 2022

Choose a reason for hiding this comment

jlowe Jun 17, 2022

Choose a reason for hiding this comment

andygrove commented Jun 17, 2022

jlowe left a comment

Choose a reason for hiding this comment

vuule commented Jun 17, 2022

Return empty dataframe when reading a Parquet file using empty `columns` option #11018

Return empty dataframe when reading a Parquet file using empty `columns` option #11018

vuule commented Jun 1, 2022 •

edited

Loading

codecov bot commented Jun 1, 2022 •

edited

Loading