Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON single quote normalization API #14729

Merged
merged 40 commits into from
Jan 24, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Jan 9, 2024

Description

The goal of this PR is to address 10004 by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC PR 14545

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 9, 2024
@shrshi shrshi added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change labels Jan 9, 2024
@github-actions github-actions bot added the CMake CMake build issue label Jan 11, 2024
@shrshi shrshi marked this pull request as ready for review January 11, 2024 18:59
@shrshi shrshi requested review from a team as code owners January 11, 2024 18:59
@andygrove
Copy link
Contributor

As discussed with @shrshi offline, we would ideally like to see a new normalizeQuotes option added to https://github.com/rapidsai/cudf/blob/branch-24.02/java/src/main/java/ai/rapids/cudf/JSONOptions.java and then have cuDF internally call the new normalize_quotes function if this option is set.

@shrshi shrshi requested a review from a team as a code owner January 13, 2024 03:53
@shrshi shrshi requested a review from elstehle January 19, 2024 22:34
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just one question

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍
Just a few minor things that I must have missed in your quote normalization test PR before. Otherwise this looks pretty good to go from my side.

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_quote_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_quote_normalization.cu Outdated Show resolved Hide resolved
@vuule vuule self-requested a review January 22, 2024 18:51
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great; just a few more things that can be cleaned up.

cpp/src/io/json/json_quote_normalization.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_quote_normalization_test.cpp Outdated Show resolved Hide resolved
@shrshi shrshi removed the 2 - In Progress Currently a work in progress label Jan 22, 2024
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake approval

@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did the copyrights drop the year 2023?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new file in this PR implementing the normalization FST, so I think only year 2024 is included in the copyright.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the java side this looks good to me.

@shrshi
Copy link
Contributor Author

shrshi commented Jan 24, 2024

/merge

@rapids-bot rapids-bot bot merged commit f800f5a into rapidsai:branch-24.02 Jan 24, 2024
68 checks passed
PointKernel pushed a commit to PointKernel/cudf that referenced this pull request Jan 25, 2024
The goal of this PR is to address [10004](rapidsai#10004) by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC [PR 14545](rapidsai#14545)

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: rapidsai#14729
rapids-bot bot pushed a commit that referenced this pull request Feb 8, 2024
This PR provides a proof-of-concept for the usage of FST in removing unquoted spaces and tabs in JSON strings. This is a useful feature in the cases where we want to cast a hierarchical JSON object to a string, and overcomes the challenge of processing mixed types using Spark. [#14865](#14865)
The FST assumes that the single quotes in the input data have already been normalized (possibly using [`normalize_single_quotes`](#14729)).

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #14931
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants