Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON - Parse mixed types as string in JSON reader #14572

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Dec 5, 2023

Description

Addresses #14239

This PR adds an option to read mixed types as string columns.
It also adds related functional changes to nested JSON reader (libcudf, cuDF-python, Java).

Details:

  • Added new option mixed_types_as_string bool in json_reader_options
  • This feature requires 2 things: finding end of struct/list nodes, parse struct/list type as string.
  • For Struct and List, node_range_end was node_range_begin+1 earlier (since it was not used anywhere). Now it is calculated properly by copying only struct and list tokens and their node_range_end is calculated. (Since end token is child of begin token, scattering end token's index to parent' token's corresponding node's node_range_end will get the node_range_end of List and Struct nodes).
  • In reduce_to_column_tree() (which infers the schema), the list and struct node_range_end are changed to node_begin+1 so that it does not copy entire list/struct strings to host for column names.
  • reinitialize_as_string reinitializes an initialized column as string.
  • Mixed type columns are parsed as strings since their column category is changed to NC_STR.
  • Added tests

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Dec 5, 2023
@karthikeyann karthikeyann self-assigned this Dec 5, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Dec 5, 2023
@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 13, 2023
@karthikeyann
Copy link
Contributor Author

karthikeyann commented Dec 14, 2023

I keep getting the following error sometimes while reading in python, and it crashes. (but libcudf read_json in c++ unit tests works).

Traceback (most recent call last):
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/asyncio/selector_events.py", line 256, in _add_reader
    key = self._selector.get_key(fd)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/selectors.py", line 193, in get_key
    raise KeyError("{!r} is not registered".format(fileobj)) from None
KeyError: '0 is not registered'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/IPython/__init__.py", line 130, in start_ipython
    return launch_new_instance(argv=argv, **kwargs)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/IPython/terminal/ipapp.py", line 317, in start
    self.shell.mainloop()
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/IPython/terminal/interactiveshell.py", line 911, in mainloop
    self.interact()
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/IPython/terminal/interactiveshell.py", line 896, in interact
    code = self.prompt_for_code()
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/IPython/terminal/interactiveshell.py", line 839, in prompt_for_code
    text = self.pt_app.prompt(
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/prompt_toolkit/shortcuts/prompt.py", line 1026, in prompt
    return self.app.run(
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/prompt_toolkit/application/application.py", line 1021, in run
    return asyncio.run(coro)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/prompt_toolkit/application/application.py", line 905, in run_async
    return await _run_async(f)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/prompt_toolkit/application/application.py", line 750, in _run_async
    with self.input.raw_mode(), self.input.attach(
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/site-packages/prompt_toolkit/input/vt100.py", line 165, in _attached_input
    loop.add_reader(fd, callback_wrapper)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/asyncio/selector_events.py", line 331, in add_reader
    self._add_reader(fd, callback, *args)
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/asyncio/selector_events.py", line 258, in _add_reader
    self._selector.register(fd, selectors.EVENT_READ,
  File "/home/karthikeyan/dev/rapids/compose/etc/conda/cuda_12.0/envs/rapids/lib/python3.10/selectors.py", line 360, in register
    self._selector.register(key.fd, poller_events)
OSError: [Errno 22] Invalid argument

If you suspect this is an IPython 8.20.0 bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @karthikeyann

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as @andygrove otherwise lgtm from Java perspective.

Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing comments on the changes to json_tree.cu. Just a few minor comments/questions.

Nice work! Changes to identify the list/struct ends are shorter than what I had expected, mostly because you managed to reuse the existing utilities where possible 👍

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Show resolved Hide resolved
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for cpp changes.
Did you get a chance to check the performance numbers?

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
karthikeyann and others added 2 commits January 16, 2024 20:13
Co-authored-by: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments attached. Overall I'm very pleased that this is < 500 lines and the implementation looks pretty good!

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/JSONOptions.java Outdated Show resolved Hide resolved
java/src/test/resources/mixed_types_1.json Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor

bdice commented Jan 18, 2024

I left a couple small comments on naming, otherwise approving.

cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
});

// propagate to siblings.
propagate_parent_to_siblings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming the function as propagate_parent_to_children_except_first? It would made the nodes to which the parent id is copied clearer from the function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It propagates from first siblings to other siblings, provided first sibling already has some data.
I renamed it to propagate_first_sibling_to_other.

Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarifications! Approved.

@karthikeyann
Copy link
Contributor Author

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants