[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630

vuule · 2019-05-04T05:26:09Z

Issue #1546

Add the Cython implementation of read_json;
Add Python tests to target the features/parameters supported by our implementation;
Fix back-end bugs uncovered by new tests:
- Incorrect behavior when byte range offset is not zero, but the byte range size is zero;
- Incorrect behavior when using byte range with buffer input;
- Prevent crashing with invalid input;
Add support for setting the data types in dictionary format (including a C++ test for this case);

…h-ext-json-lines-python

…not propagating all parameters correctly yet.

…he C++ API

…and dicts

…h-ext-json-lines-python

…cudf into enh-ext-json-lines-python

… param order to match Pandas.

…cudf into enh-ext-json-lines-python

cpp/src/io/json/json_reader.cu

kkraus14

Very minor changes, otherwise looks great!

python/cudf/bindings/json.pyx

kkraus14 · 2019-05-13T15:12:30Z

python/cudf/io/json.py

+    """
+
+    if lines:
+        df = cpp_read_json(path_or_buf, dtype, lines, compression, byte_range)


We should provide an option to still fall back to the Pandas JSON reader for JSONLines if the user wants. Could we add an engine keyword arg similar to the Parquet / ORC reader?

Added 'engine' parameter with the same semantics as in read_orc and read_parquet. The default is still cuDF, when reading JSON Lines.

I suggest either giving a info/warning if the user does not specify lines but uses cudf, or adding that info in the docstring.

I can't add a warning because 'cudf' is the default engine, so I can't tell if the user specifically requested that engine or if they are just using the default version.
Added info to the docstring.

It's more the fact that it's confusing when the default engine is cudf but that is only invoked when lines is True.

So by default if a user tries to read a normal JSON file, then the engine is cudf but it's actually using PANDAS backend and it will print the warning CPU warning message...

…json, to manually switch between cudf and pyarrow parsers

python/cudf/io/json.py

j-ieong · 2019-05-13T21:37:51Z

python/cudf/io/json.py

+    """
+
+    if lines:
+        df = cpp_read_json(path_or_buf, dtype, lines, compression, byte_range)


I suggest either giving a info/warning if the user does not specify lines but uses cudf, or adding that info in the docstring.

python/cudf/bindings/json.pyx

Co-Authored-By: Jaime Ieong <45218324+j-ieong@users.noreply.github.com>

…rue to match the API with pyarrow

…cudf into enh-ext-json-lines-python

python/cudf/utils/ioutils.py

python/cudf/tests/test_json.py

j-ieong · 2019-05-14T04:38:32Z

Whoops, accidentally requested self for review...

…me (pyarrow->pandas).

…. Add a test for the selection.

python/cudf/utils/ioutils.py

Co-Authored-By: Jaime Ieong <45218324+j-ieong@users.noreply.github.com>

vuule · 2019-05-15T16:50:29Z

@kkraus14 Updated the PR based on the suggestions, can you please update your review?

kkraus14 · 2019-05-15T17:16:37Z

@kkraus14 Updated the PR based on the suggestions, can you please update your review?

Approved, looks great!

vuule and others added 17 commits May 2, 2019 11:08

Merge branch 'branch-0.8' of https://github.com/rapidsai/cudf into en…

33b4073

…h-ext-json-lines-python

call cudf read_json when reading json lines. Has only basic support, …

e8cffa3

…not propagating all parameters correctly yet.

Add file input and byte range support to Json reader Python layer

83b80d5

JSON reader: Add support for dictionary-like dtype specification in t…

74ba93f

…he C++ API

Add dtype support to JSON reader Python layer. Works for both arrays …

850b825

…and dicts

JSON reader: Fix issues with byte range; expand tests with fixtures

a3f1518

JSON reader: fix the compression test, add checks to other tests.

f41b2c7

Robustify JSON reader wrt garbage input.

11832e4

Merge branch-0.8

aaa2db5

Revert accidental commit

6e2ca29

Update CHANGELOG.md

0a0520f

Merge branch 'branch-0.8' of https://github.com/rapidsai/cudf into en…

10b1a08

…h-ext-json-lines-python

Fix Python style

1ae58ba

Merge branch 'enh-ext-json-lines-python' of https://github.com/vuule/…

783e9c8

…cudf into enh-ext-json-lines-python

Add missing endline at the end of the file

acf615e

JSOn reader: Expand checks in basic test; Add Python API docs; change…

fc7d102

… param order to match Pandas.

Merge branch 'enh-ext-json-lines-python' of https://github.com/vuule/…

a11e684

…cudf into enh-ext-json-lines-python

vuule marked this pull request as ready for review May 6, 2019 21:41

vuule requested review from a team as code owners May 6, 2019 21:41

vuule changed the title ~~[WIP] Add Python layer to the GPU-accelerated JSON reader~~ [REVIEW] Add Python layer to the GPU-accelerated JSON reader May 6, 2019

mjsamoht added cuIO cuIO issue 3 - Ready for Review Ready for review by team labels May 6, 2019

mjsamoht requested a review from kkraus14 May 7, 2019 22:57

mjsamoht reviewed May 10, 2019

View reviewed changes

cpp/src/io/json/json_reader.cu Show resolved Hide resolved

mjsamoht approved these changes May 10, 2019

View reviewed changes

kkraus14 requested changes May 13, 2019

View reviewed changes

merge 0.8

e40911c

vuule and others added 3 commits May 13, 2019 11:47

Merge branch 'branch-0.8' into enh-ext-json-lines-python

5e934e1

replace relative imports with absolute; Add engine parameter to read_…

5f37cd2

…json, to manually switch between cudf and pyarrow parsers

merge remote

06abab2

vuule requested a review from kkraus14 May 13, 2019 20:11

j-ieong suggested changes May 13, 2019

View reviewed changes

vuule and others added 4 commits May 13, 2019 14:58

Update python/cudf/bindings/json.pyx

5211241

Co-Authored-By: Jaime Ieong <45218324+j-ieong@users.noreply.github.com>

move read_json docs to the correct file; Change default of dtype to T…

4be6a66

…rue to match the API with pyarrow

Merge branch 'enh-ext-json-lines-python' of https://github.com/vuule/…

bcc8f10

…cudf into enh-ext-json-lines-python

remove trailing whitespace

4592a02

j-ieong approved these changes May 14, 2019

View reviewed changes

j-ieong suggested changes May 14, 2019

View reviewed changes

python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

python/cudf/tests/test_json.py Outdated Show resolved Hide resolved

j-ieong self-requested a review May 14, 2019 04:36

vuule and others added 5 commits May 13, 2019 23:07

Merge branch 'branch-0.8' into enh-ext-json-lines-python

997b4c4

Expand the basic JSON test to cover both engines; Fixed the engine na…

4b85040

…me (pyarrow->pandas).

Merge branch-0.8

d2e6a7c

Merge remote

58fb5a6

Add 'auto' engine to read_json() to make the engine selection clearer…

df1ddf2

…. Add a test for the selection.

j-ieong approved these changes May 15, 2019

View reviewed changes

python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

python/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

Apply suggestions from code review

6aa41ed

Co-Authored-By: Jaime Ieong <45218324+j-ieong@users.noreply.github.com>

kkraus14 approved these changes May 15, 2019

View reviewed changes

kkraus14 merged commit aa7e668 into rapidsai:branch-0.8 May 15, 2019

quasiben mentioned this pull request May 15, 2019

[FEA] Read JSON support for dask_cudf rapidsai/dask-cudf#253

Closed

vuule deleted the enh-ext-json-lines-python branch June 13, 2019 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630

[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630

vuule commented May 4, 2019 •

edited

Loading

kkraus14 left a comment

kkraus14 May 13, 2019

vuule May 13, 2019

j-ieong May 13, 2019

vuule May 13, 2019

j-ieong May 14, 2019 •

edited

Loading

j-ieong May 13, 2019

j-ieong commented May 14, 2019 •

edited

Loading

vuule commented May 15, 2019

kkraus14 commented May 15, 2019

[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630

[REVIEW] Add Python layer to the GPU-accelerated JSON reader #1630

Conversation

vuule commented May 4, 2019 • edited Loading

kkraus14 left a comment

Choose a reason for hiding this comment

kkraus14 May 13, 2019

Choose a reason for hiding this comment

vuule May 13, 2019

Choose a reason for hiding this comment

j-ieong May 13, 2019

Choose a reason for hiding this comment

vuule May 13, 2019

Choose a reason for hiding this comment

j-ieong May 14, 2019 • edited Loading

Choose a reason for hiding this comment

j-ieong May 13, 2019

Choose a reason for hiding this comment

j-ieong commented May 14, 2019 • edited Loading

vuule commented May 15, 2019

kkraus14 commented May 15, 2019

vuule commented May 4, 2019 •

edited

Loading

j-ieong May 14, 2019 •

edited

Loading

j-ieong commented May 14, 2019 •

edited

Loading