Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630

Merged
merged 31 commits into from
May 15, 2019

Conversation

vuule
Copy link
Contributor

@vuule vuule commented May 4, 2019

Issue #1546

  • Add the Cython implementation of read_json;
  • Add Python tests to target the features/parameters supported by our implementation;
  • Fix back-end bugs uncovered by new tests:
    • Incorrect behavior when byte range offset is not zero, but the byte range size is zero;
    • Incorrect behavior when using byte range with buffer input;
    • Prevent crashing with invalid input;
  • Add support for setting the data types in dictionary format (including a C++ test for this case);

@vuule vuule marked this pull request as ready for review May 6, 2019 21:41
@vuule vuule requested review from a team as code owners May 6, 2019 21:41
@vuule vuule changed the title [WIP] Add Python layer to the GPU-accelerated JSON reader [REVIEW] Add Python layer to the GPU-accelerated JSON reader May 6, 2019
@mjsamoht mjsamoht added cuIO cuIO issue 3 - Ready for Review Ready for review by team labels May 6, 2019
@mjsamoht mjsamoht requested a review from kkraus14 May 7, 2019 22:57
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor changes, otherwise looks great!

python/cudf/bindings/json.pyx Outdated Show resolved Hide resolved
"""

if lines:
df = cpp_read_json(path_or_buf, dtype, lines, compression, byte_range)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provide an option to still fall back to the Pandas JSON reader for JSONLines if the user wants. Could we add an engine keyword arg similar to the Parquet / ORC reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 'engine' parameter with the same semantics as in read_orc and read_parquet. The default is still cuDF, when reading JSON Lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest either giving a info/warning if the user does not specify lines but uses cudf, or adding that info in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't add a warning because 'cudf' is the default engine, so I can't tell if the user specifically requested that engine or if they are just using the default version.
Added info to the docstring.

Copy link
Contributor

@j-ieong j-ieong May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more the fact that it's confusing when the default engine is cudf but that is only invoked when lines is True.

So by default if a user tries to read a normal JSON file, then the engine is cudf but it's actually using PANDAS backend and it will print the warning CPU warning message...

@vuule vuule requested a review from kkraus14 May 13, 2019 20:11
python/cudf/io/json.py Outdated Show resolved Hide resolved
"""

if lines:
df = cpp_read_json(path_or_buf, dtype, lines, compression, byte_range)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest either giving a info/warning if the user does not specify lines but uses cudf, or adding that info in the docstring.

python/cudf/bindings/json.pyx Show resolved Hide resolved
vuule and others added 4 commits May 13, 2019 14:58
python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
python/cudf/tests/test_json.py Outdated Show resolved Hide resolved
@j-ieong j-ieong self-requested a review May 14, 2019 04:36
@j-ieong
Copy link
Contributor

j-ieong commented May 14, 2019

Whoops, accidentally requested self for review...

python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
Co-Authored-By: Jaime Ieong <45218324+j-ieong@users.noreply.github.com>
@vuule
Copy link
Contributor Author

vuule commented May 15, 2019

@kkraus14 Updated the PR based on the suggestions, can you please update your review?

@kkraus14
Copy link
Collaborator

@kkraus14 Updated the PR based on the suggestions, can you please update your review?

Approved, looks great!

@kkraus14 kkraus14 merged commit aa7e668 into rapidsai:branch-0.8 May 15, 2019
@vuule vuule deleted the enh-ext-json-lines-python branch June 13, 2019 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants