-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630
[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor changes, otherwise looks great!
""" | ||
|
||
if lines: | ||
df = cpp_read_json(path_or_buf, dtype, lines, compression, byte_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should provide an option to still fall back to the Pandas JSON reader for JSONLines if the user wants. Could we add an engine
keyword arg similar to the Parquet / ORC reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 'engine' parameter with the same semantics as in read_orc and read_parquet. The default is still cuDF, when reading JSON Lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest either giving a info/warning if the user does not specify lines
but uses cudf
, or adding that info in the docstring
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't add a warning because 'cudf' is the default engine, so I can't tell if the user specifically requested that engine or if they are just using the default version.
Added info to the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more the fact that it's confusing when the default engine is cudf
but that is only invoked when lines is True
.
So by default if a user tries to read a normal JSON file, then the engine is cudf
but it's actually using PANDAS backend and it will print the warning CPU warning message...
…json, to manually switch between cudf and pyarrow parsers
""" | ||
|
||
if lines: | ||
df = cpp_read_json(path_or_buf, dtype, lines, compression, byte_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest either giving a info/warning if the user does not specify lines
but uses cudf
, or adding that info in the docstring
.
Co-Authored-By: Jaime Ieong <45218324+j-ieong@users.noreply.github.com>
…rue to match the API with pyarrow
…cudf into enh-ext-json-lines-python
Whoops, accidentally requested self for review... |
…me (pyarrow->pandas).
…. Add a test for the selection.
Co-Authored-By: Jaime Ieong <45218324+j-ieong@users.noreply.github.com>
@kkraus14 Updated the PR based on the suggestions, can you please update your review? |
Approved, looks great! |
Issue #1546