-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON - Parse mixed types as string in JSON reader #14572
JSON - Parse mixed types as string in JSON reader #14572
Conversation
* java bindings * tests * change default for mixedTypesAsStrings to false for backwards compatibility
I keep getting the following error sometimes while reading in python, and it crashes. (but libcudf read_json in c++ unit tests works).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @karthikeyann
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as @andygrove otherwise lgtm from Java perspective.
Co-authored-by: Andy Grove <andygrove73@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing comments on the changes to json_tree.cu
. Just a few minor comments/questions.
Nice work! Changes to identify the list/struct ends are shorter than what I had expected, mostly because you managed to reuse the existing utilities where possible 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for cpp changes.
Did you get a chance to check the performance numbers?
Co-authored-by: Elias Stehle <3958403+elstehle@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments attached. Overall I'm very pleased that this is < 500 lines and the implementation looks pretty good!
I left a couple small comments on naming, otherwise approving. |
cpp/src/io/json/json_tree.cu
Outdated
}); | ||
|
||
// propagate to siblings. | ||
propagate_parent_to_siblings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming the function as propagate_parent_to_children_except_first
? It would made the nodes to which the parent id is copied clearer from the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It propagates from first siblings to other siblings, provided first sibling already has some data.
I renamed it to propagate_first_sibling_to_other
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarifications! Approved.
/merge |
Description
Addresses #14239
This PR adds an option to read mixed types as string columns.
It also adds related functional changes to nested JSON reader (libcudf, cuDF-python, Java).
Details:
mixed_types_as_string
bool in json_reader_optionsreduce_to_column_tree()
(which infers the schema), the list and struct node_range_end are changed to node_begin+1 so that it does not copy entire list/struct strings to host for column names.reinitialize_as_string
reinitializes an initialized column as string.NC_STR
.Checklist